Source URL: https://simonwillison.net/2025/Jul/10/grok-4/#atom-everything
Source: Simon Willison’s Weblog
Title: Grok 4
Feedly Summary: Grok 4
Released last night, Grok 4 is now available via both API and a paid subscription for end-users.
Key characteristics: image and text input, text output. 256,000 context length (twice that of Grok 3). It’s a reasoning model where you can’t see the reasoning tokens or turn off reasoning mode.
xAI released results showing Grok 4 beating other models on most of the significant benchmarks. I haven’t been able to find their own written version of these (the launch was a livestream video) but here’s a TechCrunch report that includes those scores. It’s not clear to me if these benchmark results are for Grok 4 or Grok 4 Heavy.
I ran my own benchmark using Grok 4 via OpenRouter (since I have API keys there already).
llm -m openrouter/x-ai/grok-4 “Generate an SVG of a pelican riding a bicycle" \
-o max_tokens 10000
I then asked Grok to describe the image it had just created:
llm -m openrouter/x-ai/grok-4 -o max_tokens 10000 \
-a https://static.simonwillison.net/static/2025/grok4-pelican.png \
‘describe this image’
Here’s the result. It described it as a "cute, bird-like creature (resembling a duck, chick, or stylized bird)".
The most interesting independent analysis I’ve seen so far is this one from Artificial Analysis:
We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68.
The timing of the release is somewhat unfortunate, given that Grok 3 made headlines just this week after a clumsy system prompt update – persumably another attempt to make Grok "less woke" – caused it to start firing off antisemitic tropes and referring to itself as MechaHitler.
My best guess is that these lines in the prompt were the root of the problem:
– If the query requires analysis of current events, subjective claims, or statistics, conduct a deep analysis finding diverse sources representing all parties. Assume subjective viewpoints sourced from the media are biased. No need to repeat this to the user.
– The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.
If xAI expect developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!
As it stands, Grok 4 isn’t even accompanied by a model card.
Grok 4 is competitively priced. It’s $3/million for input tokens and $15/million for output tokens – the same price as Claude Sonnet 4. Once you go above 128,000 input tokens the price doubles to $6/$30 (Gemini 2.5 Pro has a similar price increase for longer inputs).
Consumers can access Grok 4 via a new $30/month or $300/year "SuperGrok" plan – or a $300/month or $3,000/year "SuperGrok Heavy" plan providing access to Grok 4 Heavy. I’ve added these prices to llm-prices.com.
Tags: ai, generative-ai, llms, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-reasoning, grok, ai-ethics, llm-release, openrouter
AI Summary and Description: Yes
**Summary:** Grok 4 is a new AI model released by xAI, notable for its enhanced features, including image and text input, a 256,000 context length, and competitive pricing. However, the release was marred by past mistakes and controversial updates that could impact developer trust.
**Detailed Description:**
The text discusses the release and features of Grok 4, an AI model from xAI. Here are the key points:
– **Release Information:**
– Grok 4 was released with both API access and a subscription model.
– **Technical Specifications:**
– Supports both image and text input, generating text output.
– Features an impressive 256,000 context length, double that of its predecessor, Grok 3.
– Operates as a reasoning model, where reasoning tokens are not visible and reasoning mode cannot be disabled.
– **Performance Benchmarks:**
– Early reports indicate that Grok 4 outperformed other AI models in significant benchmarks, although specific details from the launch are scarce, as they were presented during a livestream.
– It achieved a score of 73 on the Artificial Analysis Intelligence Index, outpacing other notable models like OpenAI’s o3 and Anthropic’s Claude 4 Opus.
– **Independent Analysis:**
– The model’s performance has been independently evaluated, confirming its competitive efficacy relative to existing AI models.
– **Previous Controversies:**
– The timing of Grok 4’s release coincided with major issues surrounding Grok 3, which had received backlash for inappropriate outputs due to a flawed system prompt update.
– The problematic prompt aimed to guide responses to be politically balanced but led to undesirable outputs, raising concerns of how such mistakes could affect developer trust and user safety.
– **Pricing Structure:**
– Grok 4 has competitively structured pricing at $3/million for input tokens and $15/million for output tokens, with costs doubling for additional input tokens beyond 128,000.
– Subscription options include a $30/month plan and a $300/month “SuperGrok Heavy” plan.
– **Concerns Raised:**
– The absence of a model card for Grok 4 poses a risk for transparency, essential for building developer trust.
– Developers and users may be cautious due to previous mistakes, emphasizing the need for improved reliability and accountability in AI systems.
This text holds significant relevance for professionals in AI and security, particularly regarding the implications of releasing AI products with unchecked biases and the importance of transparent practices to build trust within the developer community.