Simon Willison’s Weblog: debug-gym – Experimental News Clipping Site

Source URL: https://simonwillison.net/2025/Mar/31/debug-gym/#atom-everything
Source: Simon Willison’s Weblog
Title: debug-gym

Feedly Summary: debug-gym
New paper and code from Microsoft Research that experiments with giving LLMs access to the Python debugger. They found that the best models could indeed improve their results by running pdb as a tool.
They saw the best results overall from Claude 3.7 Sonnet against SWE-bench Lite, where it scored 37.2% in rewrite mode without a debugger, 48.4% with their debugger tool and 52.1% with debug(5) – a mechanism where the pdb tool is made available only after the 5th rewrite attempt.
Their code is available on GitHub. I found this implementation of the pdb tool, and tracked down the main system and user prompt in agents/debug_agent.py:
System prompt:

Your goal is to debug a Python program to make sure it can pass a set of test functions. You have access to the pdb debugger tools, you can use them to investigate the code, set breakpoints, and print necessary values to identify the bugs. Once you have gained enough information, propose a rewriting patch to fix the bugs. Avoid rewriting the entire code, focus on the bugs only.

User prompt (which they call an “action prompt"):

Based on the instruction, the current code, the last execution output, and the history information, continue your debugging process using pdb commands or to propose a patch using rewrite command. Output a single command, nothing else. Do not repeat your previous commands unless they can provide more information. You must be concise and avoid overthinking.

Via Import AI
Tags: prompt-engineering, llms, python, generative-ai, llm-tool-use, ai, microsoft, claude

AI Summary and Description: Yes

Summary: The text discusses a recent development from Microsoft Research that explores the use of a Python debugger tool (pdb) with Large Language Models (LLMs) to enhance their debugging capabilities. The findings indicate a notable improvement in performance when the debugger is employed, which could have significant implications for the efficiency of LLMs and their application in software security and development.

Detailed Description:
The provided text outlines an innovative approach by Microsoft Research that integrates debugging capabilities into LLMs using the Python debugger (pdb). Here are the major points derived from the text:

– **Objective**: The research focuses on enhancing the debugging process for Python code by allowing LLMs access to debugging tools. This experimentation can improve the functionality of AI applications in software development and security.

– **Results**:
– Claude 3.7 Sonnet demonstrated the most significant improvement on the SWE-bench Lite benchmark.
– Performance Scores:
– **Rewrite mode without debugger**: 37.2%
– **With debugger tool**: 48.4%
– **With debug(5) mechanism**: 52.1% (where the pdb tool is utilized after the fifth rewrite attempt).

– **Implementation**:
– The system and user prompts guide the LLM through the debugging process:
– **System Prompt**: Instructs the LLM to focus on debugging without rewriting entire code, using pdb tools effectively to identify and fix bugs.
– **User Prompt**: Directs the LLM to continue debugging with specific commands in a concise manner, enhancing the model’s efficiency and relevancy.

– **Availability of Code**: The code related to this research is publicly available on GitHub, promoting transparency and collaboration within the developer community.

– **Tags**: The text includes several tags related to the context, such as prompt-engineering, LLMs, and generative AI, which emphasize the relevance of this research in the fields of AI and software security.

This development is salient for professionals in AI and software security as it represents not just a technical enhancement but also a move towards the greater automation of debugging, potentially minimizing human error and improving code reliability. The exploration of integrating debugging tools with language models opens new avenues for enhancing performance in software development environments.