Credit:
Microsoft Research
While AI models perform better when paired with debugging tools, their overall success rate remains too low to fully replace human coders—especially in debugging tasks—according to a new study.
Microsoft Research examined the performance of various AI agents using the SWE-bench benchmark. The results showed that debugging tools significantly boosted success rates: Claude 3.7, for example, achieved a 48.4% success rate with debugging (compared to 37.2% without), while OpenAI’s models also showed improvements, with OpenAI 3-mini jumping from 8.5% to 22.1%—a 160% increase. Still, none of the models achieved performance levels that would make them reliable stand-ins for human developers.
The study suggests that current AI models struggle in part because their training data isn’t well suited to sequential decision-making tasks like debugging. Moreover, these tools don’t yet fully understand how to optimally use the debugging information provided.
The report emphasizes that this is just the beginning. The next step involves developing more refined “info-seeking models” that are better at gathering relevant information to solve bugs. In cases where using large models incurs high computational costs, smaller models could be used to gather essential details before handing the task off to a larger AI system.
This isn’t the first time AI’s limitations have been highlighted. While AI tools can sometimes generate seemingly functional code for narrow use cases, they often introduce bugs and security flaws—and typically lack the capability to fix them.
Researchers agree that the future of AI coding agents lies in tools that assist developers rather than replace them. The most realistic goal for now is to build agents that save developers significant time, not ones that can independently handle all aspects of software de
velopment.