Simon Willison’s Weblog: Andrej Karpathy’s initial impressions of Grok 3

Source URL: https://simonwillison.net/2025/Feb/18/andrej-karpathy-grok-3/
Source: Simon Willison’s Weblog
Title: Andrej Karpathy’s initial impressions of Grok 3

Feedly Summary: Andrej Karpathy’s initial impressions of Grok 3
Andrej has the most detailed analysis I’ve seen so far of xAI’s Grok 3 release from last night. He runs through a bunch of interesting test prompts, and concludes:

As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI’s strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented.

I was delighted to see him include my Generate an SVG of a pelican riding a bicycle benchmark in his tests:

Tags: andrej-karpathy, llms, ai, generative-ai, pelican-riding-a-bicycle

AI Summary and Description: Yes

Summary: The text discusses Andrej Karpathy’s analysis of xAI’s Grok 3, positioning it as a competitive model in the landscape of advanced AI, particularly among large language models (LLMs). The rapid development of Grok 3 is noteworthy, offering insights valuable for professionals in AI and security.

Detailed Description: The analysis provides several points of significance regarding the performance and implications of Grok 3, developed by xAI.

– **Rapid Advancement**: Grok 3 has achieved state-of-the-art performance in approximately one year, a swift timeline for AI model development.
– **Competitive Analysis**: Karpathy compares Grok 3’s capabilities with advanced models, including those from OpenAI and other competitors. This emphasis on direct comparison highlights the evolving nature of AI capabilities and the competitive landscape.
– **Test Prompts**: Karpathy conducted several tests using various prompts, including a notable one involving generating SVG images. This exemplifies the system’s generative capabilities, which are critical in understanding how generative AI can be applied across different contexts.
– **Benchmarking**: The inclusion of unique benchmarks to test AI performance, like generating a specific image, enriches the discourse on evaluating AI systems.

For AI professionals focusing on LLMs, this commentary underscores the importance of continuous benchmarking and keeping abreast of rapid developments in AI capabilities, particularly in security contexts where model performance can influence risk assessments and compliance measures.