Simon Willison’s Weblog: Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!

Sep 12, 2025

—

Source URL: https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-everything
Source: Simon Willison’s Weblog
Title: Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!

Feedly Summary: Qwen3-Next-80B-A3B
Qwen announced two new models via their Twitter account (nothing on their blog yet): Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking.
They make some big claims on performance:

Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship.
Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking.

The name “80B-A3B" indicates 80 billion parameters of which only 3 billion are active at a time. You still need to have enough GPU-accessible RAM to hold all 80 billion in memory at once but only 3 billion will be used for each round of inference, which provides a significant speedup in responding to prompts.
More details from their tweet:

80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)
Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall
Ultra-sparse MoE: 512 experts, 10 routed + 1 shared
Multi-Token Prediction → turbo-charged speculative decoding
Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context

The models on Hugging Face are around 150GB each so I decided to try them out via OpenRouter instead (Thinking, Instruct).
I’m used my llm-openrouter plugin. I installed it like this:
llm install llm-openrouter
llm keys set openrouter
# paste key here

Then found the model IDs with this command:
llm models -q next

Which output:
OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-thinking
OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-instruct

I have an LLM prompt template saved called pelican-svg which I created like this:
llm "Generate an SVG of a pelican riding a bicycle" –save pelican-svg

This means I can run my pelican benchmark like this:
llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-thinking

Or like this:
llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-instruct

Here’s the thinking model output (exported with llm logs -c | pbcopy after I ran the prompt):

I liked the "Whimsical style with smooth curves and friendly proportions (no anatomical accuracy needed for bicycle riding!)" note in the transcript.
The instruct (non-reasoning) model gave me this:

"Who needs legs!?" indeed! I like the penguin-flamingo emoji sequence it’s decided on for pelicans.
Tags: ai, generative-ai, llms, llm, qwen, pelican-riding-a-bicycle, llm-reasoning, llm-release, openrouter, ai-in-china

AI Summary and Description: Yes

Summary: The text provides an overview of Qwen’s new AI models, Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking, highlighting their performance claims, architectural details, and use case examples. It is particularly relevant for professionals in AI security and generative AI security due to its implications for model optimization and inference speed.

Detailed Description:
The text describes the launch of two new AI models by Qwen, elaborating on their capabilities, performance metrics, and architectural innovations. Below are the key points discussed in the text:

– **Model Specifications**:
– The Qwen3-Next-80B-A3B model features 80 billion parameters, but only 3 billion are active during inference, allowing for quicker processing and reduced computational costs.
– This architecture is designed for significant speed increases in response times when handling prompts, particularly with long contexts.

– **Performance Claims**:
– The Instruct model claims performance nearly equal to Qwen’s previous flagship model with 235 billion parameters.
– The Thinking model is positioned as outperforming Google’s Gemini-2.5-Flash-Thinking in reasoning capabilities.

– **Technological Innovations**:
– **Hybrid Architecture**: Utilizes Gated DeltaNet and Gated Attention techniques to optimize both speed and recall.
– **Ultra-sparse Mixture of Experts (MoE)**: Features 512 experts, with 10 of them being activated per inference request along with 1 shared, enhancing processing efficiency.
– **Multi-Token Prediction**: This method fosters more efficient speculative decoding, which further enhances response speed.

– **Practical Implementation**:
– Users can access the models via OpenRouter, and the text describes prompting techniques using installed plugins to evaluate model performance through a creative prompt related to generating SVG images.

– **Use Case**:
– The benchmark example involving generating a whimsical image of a pelican riding a bicycle showcases the models’ capabilities in producing detailed and creative outputs with specific aesthetic requests.

By understanding the performance capabilities and architectural advantages of these new models, security and compliance professionals can anticipate how these advancements in generative AI might impact the landscape of AI security, model training methodologies, and the ethical considerations around model deployments in production settings.

.NET 1 10 2 2025 3 32B 5 a access account accuracy Act advancement advancements after age AI ai model AI models AI security All allow and anti app Arch architectural architectural innovation architectural innovations architecture art as at ated being benchmark Best Bi bicycle bot by C capabilities China CI co coding command compliance compliance professionals computation computational costs Context cost Costs D de decoding Delta deployment deployments design e efficiency efficient end ethical ethical considerations exp expert Experts Experts (MoE) export face fast faster feature features flash for friendly g Gemini Gen generative Generative AI Go Google GPU gs H handling heap high Highlight HR http HTTPS hugging Hugging Face hybrid hybrid architecture image impact implementation implications in Inference inference speed innovation Innovations io ite J k Key keys l Labor land led Li llm llms lm logic logs long low M man mean memory methodologies metrics mini Mixture Mixture of Experts (MoE) Mode model model deployment model optimization model performance model specifications model training model training methodologies models MoE multi my N needs new next NGO no non nothing o oE of on one only ons open openrouter opt optimization ory oS out output Outputs over parameter pelican per performance performance claims performance metrics plugin plugins point pre pro process processing processing efficiency product production professionals prompt prompt template Prompting prompting techniques prompts ps Py Q QUIC Qwen R rate RCE re reasoning reasoning capabilities recall red release response response speed response times riding Ro s s Position sec security security and compliance sequence settings SHA side Sig Sim Simon Willison source specific speculative decoding speed speedup SSE SVG T Tags: Tails tech techniques technological technological innovation technological innovations ted text the thinking Time times to token token prediction TP training training method training methodologies twitter two UI Ultra under up US use user Users V val Vantage web Wi x yt z