Source URL: https://www.theregister.com/2025/03/19/llms_buggy_code/
Source: The Register
Title: Show top LLMs buggy code and they’ll finish off the mistakes rather than fix them
Feedly Summary: One more time, with feeling … Garbage in, garbage out, in training and inference
Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets.…
AI Summary and Description: Yes
Summary: This text presents research findings on the limitations of large language models (LLMs) in coding tasks, particularly their tendency to replicate existing bugs when completing code snippets. The study emphasizes the need for improvements in LLM capabilities, particularly in terms of understanding programming syntax and error detection.
Detailed Description:
The research conducted by a group of scientists analyzed the performance of seven LLMs on bug-prone code snippets, exploring their reliability in generating accurate code. Key insights from this analysis include:
– **Bug Replication**: The study shows that LLMs often repeat existing bugs rather than correcting them, highlighting a significant limitation in their application for software development.
– **Tested Models**: The models tested included OpenAI’s GPT-4o and GPT-3.5, Meta’s CodeLlama-13B-hf, Google’s Gemma-7B, BigCode’s StarCoder2-15B, and Salesforce’s CodeGEN-350M. Each model faced varying degrees of success in handling buggy code.
– **Error Rate**: The research found that in tasks involving buggy code, the models had a nearly equal probability of producing correct and buggy completions (e.g., GPT-4 had a success rate of 12.27% in buggy contexts compared to 29.85% in normal scenarios).
– **Bug Memorization**: The extent to which these LLMs reproduced historical bugs was significant; on average, 44.44% of the bugs generated were identical to known issues. This ranged from 15% to 83% memorization across different models, with GPT-4o exhibiting the highest rates.
– **Types of Errors**: LLMs struggled more with complex coding constructs, such as method invocations and return statements, compared to simpler syntax.
– **Recommendations for Improvement**: The researchers suggest that advancements in the models’ ability to understand programming syntax and semantics, robust error detection mechanisms, and integration with development tools could significantly enhance their reliability.
– **Conclusion**: The findings highlight the need for continuous improvement in LLM technology, particularly for software development applications where accuracy is critical.
This research provides valuable insights for AI developers, cybersecurity professionals, and developers utilizing these models for code generation, indicating a need for caution and improvement in existing AI coding assistance tools.