Simon Willison’s Weblog: Hallucinations in code are the least dangerous form of LLM mistakes

Mar 2, 2025

—

Source URL: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/#atom-everything
Source: Simon Willison’s Weblog
Title: Hallucinations in code are the least dangerous form of LLM mistakes

Feedly Summary: A surprisingly common complaint I see from developers who have tried using LLMs for code is that they encountered a hallucination – usually the LLM inventing a method or even a full software library that doesn’t exist – and it crashed their confidence in LLMs as a tool for writing code. How could anyone productively use these things if they invent methods that don’t exist?
Hallucinations in code are the least harmful hallucinations you can encounter from a model.
The moment you run that code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.
Compare this to hallucinations in regular prose, where you need a critical eye, strong intuitions and well developed fact checking skills to avoid sharing information that’s incorrect and directly harmful to your reputation.
With code you get a powerful form of fact checking for free. Run the code, see if it works.
In some setups – ChatGPT Code Interpreter, Claude Code, any of the growing number of “agentic" code systems that write and then execute code in a loop – the LLM system itself will spot the error and automatically correct itself.
If you’re using an LLM to write code without even running it yourself, what are you doing?
Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems – they dropped them at the first hurdle.
My cynical side suspects they may have been looking for a reason to dismiss the technology and jumped at the first one they found.
My less cynical side assumes that nobody ever warned them that you have to put a lot of work in to learn how to get good results out of these systems. I’ve been exploring their applications for writing code for over two years now and I’m still learning new tricks (and new strengths and weaknesses) almost every day.
The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!
Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review – or even comprehensive automated tests – will demonstrably prove that code actually does the right thing. You have to run it yourself!
Proving to yourself that the code works is your job. This is one of the many reasons I don’t think LLMs are going to put software professionals out of work.
LLM code will usually look fantastic: good variable names, convincing comments, clear type annotations and a logical structure. This can lull you into a false sense of security, in the same way that a gramatically correct and confident answer from ChatGPT might tempt you to skip fact checking or applying a skeptical eye.
The way to avoid those problems is the same as how you avoid problems in code by other humans that you are reviewing, or code that you’ve written yourself: you need to actively exercise that code. You need to have great manual QA skills.
A general rule for programming is that you should never trust any piece of code until you’ve seen it work with your own eye – or, even better, seen it fail and then fixed it.
Across my entire career, almost every time I’ve assumed some code works without actively executing it – some branch condition that rarely gets hit, or an error message that I don’t expect to occur – I’ve later come to regret that assumption.
If you really are seeing a deluge of hallucinated details in the code LLMs are producing for you, there are a bunch of things you can do about it.

Try different models. It might be that another model has better training data for your chosen platform. As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).
Learn how to use the context. If an LLM doesn’t know a particular library you can often fix this by dumping in a few dozen lines of example code. LLMs are incredibly good at imitating things, and at rapidly picking up patterns from very limited examples. Modern model’s have increasingly large context windows – I’ve recently started using Claude’s new GitHub integration to dump entire repositories into the context and it’s been working extremely well for me.
Chose boring technology. I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.

I’ll finish this rant with a related observation: I keep seeing people say "if I have to review every line of code an LLM writes, it would have been faster to write it myself!"
Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Tags: ai, openai, generative-ai, llms, ai-assisted-programming, anthropic, claude, code-interpreter, ai-agents

AI Summary and Description: Yes

Summary: The text discusses the challenges developers face when using Large Language Models (LLMs) for coding, particularly concerning “hallucinations” where the LLM generates non-existent methods. It emphasizes the importance of fact-checking and manual quality assurance, suggesting that developers should actively run and review code generated by LLMs instead of relying solely on them. The insights highlight the significance of well-developed coding and review skills in leveraging AI tools effectively.

Detailed Description:

The provided text elaborates on a critical issue surrounding the use of LLMs in programming, particularly their tendency to create “hallucinations,” which are incorrect or made-up code outputs that do not exist in reality. The implications for security and compliance professionals include a need for diligence and proactive verification in AI-assisted coding environments. Key points include:

– **Understanding Hallucinations**:
– Developers often encounter hallucinations when LLMs invent methods that do not exist, which can lead to a lack of trust in LLMs as coding tools.
– Unlike prose hallucinations, code hallucinations can be immediately identified through testing, providing a layer of real-time fact-checking.

– **Fact-checking Code**:
– The text highlights that running generated code is a necessary step to ensure its functionality, as hallucinated methods can lead to errors that are easily caught through execution.
– Developers should utilize tools like ChatGPT Code Interpreter to identify and correct these errors automatically.

– **Importance of Manual Review**:
– Even if code compiles without errors, it’s essential to verify that it behaves as intended. This alleviates the risk associated with LLM-generated code, which might superficially appear correct.
– Relying on strong manual QA skills helps prevent the pitfalls of assuming the correctness of any generated code.

– **Skill Development for Developers**:
– The text argues that developers who dismiss LLMs after encountering hallucinations may not have invested sufficient time in learning how to effectively utilize these technologies.
– Continuous learning and enhancing skills in code review and critical thinking are necessary to adapt to advancements in AI coding tools.

– **Practical Recommendations**:
– Utilize varied models based on the project to mitigate the risk of hallucinations.
– Enhance context input for LLMs by sharing relevant example code to improve output accuracy.
– Opt for well-established libraries that are likely to have better support and recognition by LLMs.

– **Conclusion**:
– While there is a temptation to quickly critique LLMs, the text suggests that constructing a strong foundation in understanding and reviewing both human and AI-generated code is vital.
– It highlights that reviewing LLM-generated code can enhance one’s programming skills, thereby positioning LLMs as tools to complement rather than replace software professionals.

Overall, the text provides nuanced insights into the interaction between software development and AI tools, underscoring the need for a balanced approach that combines AI capabilities with human oversight and skill development in the field of coding security.

-4o .NET 2 3 4 5 7 7 Sonnet a accuracy Act advancement advancements after agent agents AGI AI AI tool AI tools ai-agents ai-assisted-programming and Anthropic API Application applications Aria ARM art as assisted assisted coding assurance Auto based by C capabilities CERN challenges chat ChatGPT checking CIA Claude CleaR code Code Interpreter code review coding coding environment coding environments coding security coding tool coding tools compiler compliance compliance professionals Context context window continuous learning correctness critical critical thinking cross D data day de demo developer developers development e E 3 effective end environment ERP error errors event execution exp face fact fail fast first for free full functionality g Gen generated generative git GitHub Go GPT GPT-4o gs H hallucination hallucinations high Highlight HR http HTTPS human human oversight implications in information insights integration inter interaction interpret interpreter IRS ite J Java JavaScript job Just k Key l Labor language language model language models large large language model large language models Large Language Models (LLMs) learning least led Li libraries library llm llms lm logic loop man media mini model models Modern my N nation no non NPU o o3 oE of on one open openai OPM opt out Outputs over oversight patterns pitfalls platform point Power pre proactive problem product professionals programming programming skills project Py Python quality assurance QUIC R rag rate RCE reading real real-time recommendations red reputation right Risk RMF Ro Rust s sec security security and compliance self SHA sharing side Sig Sim SoC software software development software library source SSE start system systems T Tags: Tails tech technologies technology test Testing text the Time to tool tools Tor TP training training data trie trust two type UI up ups US use V verification web Well Wi Wind Windows x Zen