Simon Willison’s Weblog: Aider Polyglot leaderboard results for Claude 3.7 Sonnet

Feb 25, 2025

—

Source URL: https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/
Source: Simon Willison’s Weblog
Title: Aider Polyglot leaderboard results for Claude 3.7 Sonnet

Feedly Summary: Aider Polyglot leaderboard results for Claude 3.7 Sonnet
Paul Gauthier’s Aider Polyglot benchmark is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating new models.
The brand new Claude 3.7 Sonnet just took the top place, when run with an increased 32,000 thinking token limit.
It’s interesting comparing the benchmark costs – 3.7 Sonnet spent $36.83 running the whole thing, significantly more than the previously leading DeepSeek R1 + Claude 3.5 combo, but a whole lot less than third place o1-high:

Model
% completed
Total Cost

claude-3-7-sonnet-20250219 (32k thinking tokens)
64.9%
$36.83

DeepSeek R1 + claude-3-5-sonnet-20241022
64.0%
$13.29

o1-2024-12-17 (high)
61.7%
$186.5

claude-3-7-sonnet-20250219 (no thinking)
60.4%
$17.72

o3-mini (high)
60.4%
$18.16

No results yet for Claude 3.7 Sonnet on the LM Arena leaderboard, which has recently been dominated by Gemini 2.0 and Grok 3.
Via @paulgauthier
Tags: aider, anthropic, claude, evals, generative-ai, ai, llms, paul-gauthier

AI Summary and Description: Yes

Summary: The text discusses the latest performance and cost metrics of the Claude 3.7 Sonnet model as evaluated through the Aider Polyglot leaderboard, highlighting its superior capabilities and comparing it with other models in terms of efficiency and expense. This is of particular relevance to professionals in AI and LLM security due to the implications of model performance on security practices and operational costs.

Detailed Description: The provided content focuses on the recent evaluation of the Claude 3.7 Sonnet model within the context of independent benchmarking of large language models (LLMs). Paul Gauthier’s Aider Polyglot benchmark is recognized for its focus on LLMs that involve coding, which is critical for development and operational efficiency in AI.

Key Points:
– **Model Performance**: The Claude 3.7 Sonnet model achieved the highest completion rate in the benchmark, demonstrating its effectiveness for tasks likely related to code generation or technical problem-solving.
– **Cost Efficiency**: The benchmarking results reveal varying costs for different models, suggesting that while higher performance models may incur more expenses, they may also yield better results, which can be critical in decision-making.
– **Comparative Analysis**: A comparison of costs among leading models indicates that the Claude 3.7 Sonnet, despite its higher operational cost compared to the previously leading combination, still offers significant advantages over the lower-ranked models.
– **Market Positioning**: The text also notes the absence of Claude 3.7 on the LM Arena leaderboard, indicating competitive dynamics in the market and reflecting how rapidly evolving benchmarks can impact model adoption and trust in AI systems.
– **Implications for Security Professionals**: The performance and cost metrics could influence security and compliance architects, as they evaluate models for deployment in secure environments, needing to balance performance with budget constraints and potential vulnerabilities associated with code generation models.

This analysis emphasizes the importance of performance metrics in assessing AI models, particularly amidst growing concerns about security implications tied to generative AI capabilities and deployment in sensitive applications.

.NET 1 2 2024 24 3 4 5 5-Sonnet 7 7 Sonnet a Act adoption AI ai model AI models AI systems aider analysis and Anthropic API Application applications Arch art as benchmark benchmarking benchmarking results benchmarks board budget constraints by C capabilities CERN CIA Claude Claude 3.5 Claude-3 code code generation coding competitive competitive dynamics compliance concerns content Context cost cost efficiency cost metrics Costs critical D de decision decision-making DeepSeek DeepSeek R1 demo deployment development e E 3 effective effectiveness efficiency end environment evals evaluation exp for g Gemini Gemini 2 Gemini 2.0 Gen generation generative Generative AI Grok Grok 3 gs high Highlight HR http HTTPS implications in Influence inter ite J Just k Key l language language model language models large large language model large language models Large Language Models (LLMs) led llm llms lm low making man market market position market positioning metrics mini model model performance models my nation no notes o o1 o3 of off on one operation operational cost Operational Costs operational efficiency OPM opt out over performance performance metrics point potential pre problem problem-solving professionals R R1 Rank rate RCE red Ro Rust s sec secure secure environment secure environments security security and compliance security implications security practices security professionals sensitive applications Sig Sim SoC solving source SSE system systems T Tags: Task tasks tech test text the third to token tokens TP trust trust in AI up US use uth V val Valuation Vantage vulnerabilities web Wi x