Source URL: https://simonwillison.net/2025/Aug/1/faster-inference/
Source: Simon Willison’s Weblog
Title: Faster inference
Feedly Summary: Two interesting examples of inference speed as a flagship feature of LLM services today.
First, Cerebras announced two new monthly plans for their extremely high speed hosted model service: Cerebras Code Pro ($50/month, 1,000 messages a day) and Cerebras Code Max ($200/month, 5,000/day). The model they are selling here is Qwen’s Qwen3-Coder-480B-A35B-Instruct, likely the best available open weights coding model right now and one that was released just ten days ago. Ten days from model release to third-party subscription service feels like some kind of record.
Cerebras claim they can serve the model at an astonishing 2,000 tokens per second – four times the speed of Claude Sonnet 4 in their demo video.
Also today, Moonshot announced a new hosted version of their trillion parameter Kimi K2 model called kimi-k2-turbo-preview:
🆕 Say hello to kimi-k2-turbo-preview
Same model. Same context. NOW 4× FASTER.
⚡️ From 10 tok/s to 40 tok/s.
💰 Limited-Time Launch Price (50% off until Sept 1)
$0.30 / million input tokens (cache hit)
$1.20 / million input tokens (cache miss)
$5.00 / million output tokens
👉 Explore more: platform.moonshot.ai
This is twice the price of their regular model for 4x the speed (increasing to 4x the price in September). No details yet on how they achieved the speed-up.
I am interested to see how much market demand there is for faster performance like this. I’ve experimented with Cerebras in the past and found that the speed really does make iterating on code with live previews feel a whole lot more interactive.
Tags: generative-ai, cerebras, llm-pricing, ai, ai-in-china, llms, qwen
AI Summary and Description: Yes
Summary: The text discusses the recent advancements in the inference speed of large language model (LLM) services offered by Cerebras and Moonshot. It highlights their competitive pricing and exceptional performance, which demonstrates a significant shift in the availability and efficiency of AI-driven coding models.
Detailed Description:
The provided text delves into two noteworthy announcements from LLM service providers, focusing on their advancements in inference speed—a critical aspect for developers working with AI and coding applications.
– **Cerebras’ Monthly Plans**:
– Launches two subscription plans: Cerebras Code Pro ($50/month, 1,000 messages/day) and Cerebras Code Max ($200/month, 5,000 messages/day).
– Specifically utilizes Qwen’s Qwen3-Coder-480B-A35B-Instruct model, recognized as one of the most advanced open-weight coding models available, released just ten days prior.
– Claims an impressive serving speed of 2,000 tokens per second, which is four times faster than its competitor’s, Claude Sonnet 4.
– **Moonshot’s New Offering**:
– Introduces the kimi-k2-turbo-preview, a faster version of their existing trillion-parameter Kimi K2 model.
– This model offers performance enhancement from 10 tokens/second to 40 tokens/second.
– The temporary pricing set at $0.30/million input tokens (cache hit), $1.20/million input tokens (cache miss), and $5.00/million output tokens represents a 50% discount until September 1 but will double after the promotional period.
– **Market Implications**:
– The text suggests an ongoing trend towards faster performance in the LLM landscape, which could influence market demand significantly.
– The author expresses personal interest in evaluating the market’s response to these new performance metrics.
– Previous experiences indicated that increased speed not only enhances interaction but also streamlines the process of coding with live previews.
Overall, these developments signify crucial shifts in the AI landscape, particularly in the realms of generative AI security and LLM utilization. Faster models may enhance performance for AI applications, but they must also be accompanied by robust security measures to protect sensitive data and maintain compliance within the evolving regulatory framework.