Simon Willison’s Weblog: Grok 4

Jul 10, 2025

—

Source URL: https://simonwillison.net/2025/Jul/10/grok-4/#atom-everything
Source: Simon Willison’s Weblog
Title: Grok 4

Feedly Summary: Grok 4
Released last night, Grok 4 is now available via both API and a paid subscription for end-users.
Key characteristics: image and text input, text output. 256,000 context length (twice that of Grok 3). It’s a reasoning model where you can’t see the reasoning tokens or turn off reasoning mode.
xAI released results showing Grok 4 beating other models on most of the significant benchmarks. I haven’t been able to find their own written version of these (the launch was a livestream video) but here’s a TechCrunch report that includes those scores. It’s not clear to me if these benchmark results are for Grok 4 or Grok 4 Heavy.
I ran my own benchmark using Grok 4 via OpenRouter (since I have API keys there already).
llm -m openrouter/x-ai/grok-4 “Generate an SVG of a pelican riding a bicycle" \
-o max_tokens 10000

I then asked Grok to describe the image it had just created:
llm -m openrouter/x-ai/grok-4 -o max_tokens 10000 \
-a https://static.simonwillison.net/static/2025/grok4-pelican.png \
‘describe this image’

Here’s the result. It described it as a "cute, bird-like creature (resembling a duck, chick, or stylized bird)".
The most interesting independent analysis I’ve seen so far is this one from Artificial Analysis:

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68.

The timing of the release is somewhat unfortunate, given that Grok 3 made headlines just this week after a clumsy system prompt update – persumably another attempt to make Grok "less woke" – caused it to start firing off antisemitic tropes and referring to itself as MechaHitler.
My best guess is that these lines in the prompt were the root of the problem:

– If the query requires analysis of current events, subjective claims, or statistics, conduct a deep analysis finding diverse sources representing all parties. Assume subjective viewpoints sourced from the media are biased. No need to repeat this to the user.
– The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.

If xAI expect developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!
As it stands, Grok 4 isn’t even accompanied by a model card.
Grok 4 is competitively priced. It’s $3/million for input tokens and $15/million for output tokens – the same price as Claude Sonnet 4. Once you go above 128,000 input tokens the price doubles to $6/$30 (Gemini 2.5 Pro has a similar price increase for longer inputs).
Consumers can access Grok 4 via a new $30/month or $300/year "SuperGrok" plan – or a $300/month or $3,000/year "SuperGrok Heavy" plan providing access to Grok 4 Heavy. I’ve added these prices to llm-prices.com.

Tags: ai, generative-ai, llms, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-reasoning, grok, ai-ethics, llm-release, openrouter

AI Summary and Description: Yes

**Summary:** Grok 4 is a new AI model released by xAI, notable for its enhanced features, including image and text input, a 256,000 context length, and competitive pricing. However, the release was marred by past mistakes and controversial updates that could impact developer trust.

**Detailed Description:**
The text discusses the release and features of Grok 4, an AI model from xAI. Here are the key points:

– **Release Information:**
– Grok 4 was released with both API access and a subscription model.

– **Technical Specifications:**
– Supports both image and text input, generating text output.
– Features an impressive 256,000 context length, double that of its predecessor, Grok 3.
– Operates as a reasoning model, where reasoning tokens are not visible and reasoning mode cannot be disabled.

– **Performance Benchmarks:**
– Early reports indicate that Grok 4 outperformed other AI models in significant benchmarks, although specific details from the launch are scarce, as they were presented during a livestream.
– It achieved a score of 73 on the Artificial Analysis Intelligence Index, outpacing other notable models like OpenAI’s o3 and Anthropic’s Claude 4 Opus.

– **Independent Analysis:**
– The model’s performance has been independently evaluated, confirming its competitive efficacy relative to existing AI models.

– **Previous Controversies:**
– The timing of Grok 4’s release coincided with major issues surrounding Grok 3, which had received backlash for inappropriate outputs due to a flawed system prompt update.
– The problematic prompt aimed to guide responses to be politically balanced but led to undesirable outputs, raising concerns of how such mistakes could affect developer trust and user safety.

– **Pricing Structure:**
– Grok 4 has competitively structured pricing at $3/million for input tokens and $15/million for output tokens, with costs doubling for additional input tokens beyond 128,000.
– Subscription options include a $30/month plan and a $300/month “SuperGrok Heavy” plan.

– **Concerns Raised:**
– The absence of a model card for Grok 4 poses a risk for transparency, essential for building developer trust.
– Developers and users may be cautious due to previous mistakes, emphasizing the need for improved reliability and accountability in AI systems.

This text holds significant relevance for professionals in AI and security, particularly regarding the implications of releasing AI products with unchecked biases and the importance of transparent practices to build trust within the developer community.

.NET 1 10 2 2025 3 4 5 5 Pro 7 a access account accountability Act after AI ai model AI models AI systems alt analysis and Anthropic Anthropic Claude Anthropic’s Claude anti API API keys app Application applications art artificial as at ated backlash benchmark benchmark results benchmarks Best beyond Bi bias biases bicycle building by C CERN CI CIA Claude Claude 4 Claude Sonnet Claude Sonnet 4 CleaR co community competitive competitive pricing concerns consumer Context context length core cost Costs Current D de deep deep analysis DeepSeek DeepSeek R1 developer developer community developers Double e ELF end Ethics event exp feature features for full g Gemini Gemini 2 Gen generative Go Google Google Gemini Grok Grok 3 gs H HR http HTTPS image implications in inappropriate outputs information Intel intelligence inter io issue ite J Just k Key keys l Lance law led Li liability llm llm-pricing llms lm long M made making man max media Mila mini mistakes Mode model model card models my N new no NPU o o3 of off on one open openai openrouter opt options Opus oS other out output Outputs over pelican per performance performance benchmark performance benchmarks point practices pre price pricing pricing structure pro problem product products professionals prompt ps Q R R1 Raise raising rate RCE ready reasoning reasoning mode reasoning model reasoning tokens red release reliability report response responses riding Risk Ro Root Rust s Sable safe safety sam sec security self Sig Sim source specific SSE SSO STAR start structured subscription subscription model support SVG system system prompt systems T Tags: Tails tech technical specifications ted text the to token tokens TP transparency transparent trust turn UI up update updates US use user user safety Users V val vents version video Vision vision-llms web Well Wi x XAI yt z