The Register: Show top LLMs buggy code and they’ll finish off the mistakes rather than fix them

Mar 19, 2025

—

Source URL: https://www.theregister.com/2025/03/19/llms_buggy_code/
Source: The Register
Title: Show top LLMs buggy code and they’ll finish off the mistakes rather than fix them

Feedly Summary: One more time, with feeling … Garbage in, garbage out, in training and inference
Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets.…

AI Summary and Description: Yes

Summary: This text presents research findings on the limitations of large language models (LLMs) in coding tasks, particularly their tendency to replicate existing bugs when completing code snippets. The study emphasizes the need for improvements in LLM capabilities, particularly in terms of understanding programming syntax and error detection.

Detailed Description:

The research conducted by a group of scientists analyzed the performance of seven LLMs on bug-prone code snippets, exploring their reliability in generating accurate code. Key insights from this analysis include:

– **Bug Replication**: The study shows that LLMs often repeat existing bugs rather than correcting them, highlighting a significant limitation in their application for software development.

– **Tested Models**: The models tested included OpenAI’s GPT-4o and GPT-3.5, Meta’s CodeLlama-13B-hf, Google’s Gemma-7B, BigCode’s StarCoder2-15B, and Salesforce’s CodeGEN-350M. Each model faced varying degrees of success in handling buggy code.

– **Error Rate**: The research found that in tasks involving buggy code, the models had a nearly equal probability of producing correct and buggy completions (e.g., GPT-4 had a success rate of 12.27% in buggy contexts compared to 29.85% in normal scenarios).

– **Bug Memorization**: The extent to which these LLMs reproduced historical bugs was significant; on average, 44.44% of the bugs generated were identical to known issues. This ranged from 15% to 83% memorization across different models, with GPT-4o exhibiting the highest rates.

– **Types of Errors**: LLMs struggled more with complex coding constructs, such as method invocations and return statements, compared to simpler syntax.

– **Recommendations for Improvement**: The researchers suggest that advancements in the models’ ability to understand programming syntax and semantics, robust error detection mechanisms, and integration with development tools could significantly enhance their reliability.

– **Conclusion**: The findings highlight the need for continuous improvement in LLM technology, particularly for software development applications where accuracy is critical.

This research provides valuable insights for AI developers, cybersecurity professionals, and developers utilizing these models for code generation, indicating a need for caution and improvement in existing AI coding assistance tools.

-4o 1 2 2025 3 4 5 7 a accuracy advancement advancements AI AI developers analysis and anti Application applications Arch art as assistance average Bug bug replication bugs by C capabilities caution co code code generation coding coding assistance coding assistance tools coding tasks Context continuous improvement critical cross cyber cybersecurit Cybersecurity Cybersecurity Professionals D de Dell detection detection mechanisms developer developers development development tools e end error error detection error rate errors exp face for g Gemma Gen generated generation GIS Go Google GPT GPT-4o Group gs H high Highlight http HTTPS in Inference insights integration iOS k Key l language language model language models large large language model large language models Large Language Models (LLMs) law led Li liability limitations llama llm llms lm man Meta Mode model models N no o of off on one open openai OPM out performance pre professionals programming programming syntax R R2 rag rate RCE recommendations red reliability replicate replication research researchers return Ro RoT s sales Salesforce scientists search sec security security professionals Semantic Sig Sim Simple software software development source state study syntax T Task tasks tech technology test text the Time to tool tools Tor TP training type up US V val Ware Wi x