The Register: Tool touted as ‘first AI software engineer’ is bad at its job, testers claim

Jan 23, 2025

—

Source URL: https://www.theregister.com/2025/01/23/ai_developer_devin_poor_reviews/
Source: The Register
Title: Tool touted as ‘first AI software engineer’ is bad at its job, testers claim

Feedly Summary: Nailed just 15% of assigned tasks
A service described as “the first AI software engineer" appears to be rather bad at its job, based on a recent evaluation.…

AI Summary and Description: Yes

**Summary:** The evaluation of an AI software engineer named “Devin,” created by Cognition AI, reveals significant shortcomings in its performance and reliability. Despite ambitious claims about its capabilities, early testing by data scientists showed that Devin could only successfully complete a fraction of the tasks it was assigned, raising concerns about its practical utility and security implications.

**Detailed Description:**
The text focuses on the recent performance evaluation of Devin, touted as the “first AI software engineer” by Cognition AI. It highlights the following key points and implications:

– **Ambitious Claims vs. Reality:**
– Cognition AI marketed Devin as capable of end-to-end app development, bug fixes in codebases, and aiding in team projects.
– The tool operates via Slack commands and runs within a Docker container, integrating with external services like SendGrid for email.

– **Testing and Performance Issues:**
– Early tests by data scientists found that Devin completed only 3 out of 20 tasks successfully, raising doubts about its operational effectiveness.
– Tasks that appeared straightforward often resulted in prolonged attempts and dead ends, demonstrating difficulties in both understanding tasks and delivering working solutions.

– **Concerns Over Reliability and Security:**
– The AI’s autonomous features, which were initially seen as a benefit, ended up being detrimental, leading to excessive attempts on tasks that were impossible or outside its capabilities.
– Comments from developers pointed out critical security issues within the AI’s output, underlining the potential risks when using such autonomous tools in software development.

– **User Experience:**
– While the software provided a polished UI and was impressive when functional, the reliability was poor, with users unable to predict the success of tasks.
– The notion of reliability in software engineering, especially in high-stakes environments, is vital for security and compliance professionals.

**Implications for Security and Compliance Professionals:**
– The assessment of Devin underscores the importance of rigorous testing and analysis of AI tools before deployment in critical environments.
– It highlights potential security risks linked with autonomous systems, including inadequate task completion and the generation of insecure code.
– Professionals in the fields of AI and security should remain vigilant regarding claims made by AI developers and conduct independent assessments to ensure compliance with security standards and frameworks.

Overall, the evaluation of Devin is not only relevant to software developers but also serves as a cautionary tale for security practitioners focusing on the integration of AI into development processes.

1 2 3 5 a Act AI AI developers AI tool AI tools analysis and as assessment Auto autonomous autonomous systems based Bug bug fixes by C capabilities CIA code codebase Codebases command compliance compliance professionals concerns container core critical D data data scientists de demo deployment developer developers development Docker Docker container e effective effectiveness email end engineering environment evaluation exp experience External External Services features first for framework frameworks full g Gen generation GIS Go gs high Highlight http HTTPS implications in insecure code integration IRS ite Just k l led liability Link linked long low market no o of on one operation operational effectiveness out over performance performance evaluation performance issues point pre professionals projects R raising rate RCE real red reliability Rigorous Testing Risk risks Ro s scientists sec secure security security and compliance security implications security issues security risk security risks security standards service services short side Sig Slack software software developers software development software engineer software engineering source SSE standards system systems T Task tasks test Testing text the to tool tools TP UI up US use user user experience V val Valuation Wi x