Simon Willison’s Weblog: Qwen3-4B-Thinking: "This is art – pelicans don’t ride bikes!"

Aug 11, 2025

—

Source URL: https://simonwillison.net/2025/Aug/10/qwen3-4b/#atom-everything
Source: Simon Willison’s Weblog
Title: Qwen3-4B-Thinking: "This is art – pelicans don’t ride bikes!"

Feedly Summary: I’ve fallen a few days behind keeping up with Qwen. They released two new 4B models last week: Qwen3-4B-Instruct-2507 and its thinking equivalent Qwen3-4B-Thinking-2507.
These are relatively tiny models that punch way above their weight. I’ve been running the 8bit GGUF varieties via LM Studio (here’s Instruct, here’s Thinking) – both of them are 4GB downloads that use around 4.3GB of my M2 MacBook Pro’s system RAM while running. Both are way more capable than I would expect from such small files.
Qwen3-4B-Thinking is the first model I’ve tried which called out the absurdity of being asked to draw a pelican riding a bicycle!
Here’s the full transcript – some highlights from the thinking trace:

Okay, the user asked for an SVG of a pelican riding a bicycle. That’s… oddly specific and completely unrealistic.
First reaction: Pelicans don’t ride bicycles. They’re birds that fly or swim, not human-like vehicles. This feels like a mix-up – maybe they meant “a cyclist" but typed "pelican"? Or perhaps it’s a joke?
Hmm, user might be:

A kid playing with words (pelican + bicycle = silly combo)
Someone testing if I’ll generate impossible images
Actually wanting a pelican on a bike for art project? But that still doesn’t make sense biologically

Checks SVG capabilities
SVG is for vector graphics, but this request violates physics and biology. If I tried to draw it literally:

Pelican would be too big for a bike frame
Bike wheels would be tiny compared to bird wings
No realistic way to show "riding" since pelicans don’t have legs for pedaling

[…]
Self-check: Am I being too pedantic? Nah – if someone asks for impossible things, it’s better to gently correct than make fake art that could confuse them. […]
I appreciate your creative request! 🦅🚴‍♂️ However, pelicans don’t ride bicycles (they’re birds that fly/swim, not humans with bike skills). Generating this as a literal SVG would be biologically impossible and visually confusing.

"This request violates physics and biology" – Qwen3-4B-Thinking saying what no other model ever thought to express.
I also really enjoyed the "Am I being too pedantic? Nah" line. This model is sassy!
It continued on to spit out two SVGS: the first titled "Most Likely: A Human Cyclist" and the second "Artistic Interpretation: Pelican On a Bicycle (For Fun!)".
They were both terrible. Here’s the human one:

And here’s the supposed pelican:

I like Qwen’s decision to include the clarifying annotation "This is art – pelicans don’t ride bikes!":
I tried the Qwen3-4B-Instruct non-thinking model too. It answered much faster (no time spent questioning my choice of task with its thinking tokens) and gave me this:

4B is such an interesting model size. These models should run on almost anything and, at least on my M2 MacBook, they run fast. I’m getting 50+ tokens per second and they’re using just less than 4.5GB of RAM while running.
The question is always how useful such a tiny model can be. Clearly it’s not great for SVG pelican illustrations!
These models have a 262,144 token context – wildly impressive, if it works.
So I tried another experiment: I used the Instruct model to summarize this Hacker News conversation about GPT-5.
I did this with the llm-lmstudio plugin for LLM combined with my hn-summary.sh script, which meant I could run the experiment like this:
hn-summary.sh 44851557 -m qwen3-4b-instruct-2507

I believe this is 15,785 tokens – so nothing close to the 262,144 maximum but still an interesting test of a 4GB local model.
The good news is Qwen spat out a genuinely useful summary of the conversation! You can read that here – it’s the best I’ve seen yet from a model running on my laptop, though honestly I’ve not tried many other recent models in this way.
The bad news… it took almost five minutes to process and return the result!
As a loose calculation, if the model can output 50 tokens/second maybe there’s a similar speed for processing incoming input.. in which case 15785 / 50 = 315 seconds which is 5m15s.
Hosted models can crunch through 15,000 tokens of input in just a few seconds. I guess this is one of the more material limitations of running models on Apple silicon as opposed to dedicated GPUs.
I think I’m going to spend some more time with these models. They’re fun, they have personality and I’m confident there are classes of useful problems they will prove capable at despite their small size. Their ability at summarization should make them a good fit for local RAG, and I’ve not started exploring their tool calling abilities yet.
Tags: ai, generative-ai, local-llms, llms, qwen, pelican-riding-a-bicycle, llm-release, lm-studio

AI Summary and Description: Yes

**Summary:** The text discusses the recent release of two 4 billion parameter models, Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507. Despite their small size, these models exhibit noteworthy capabilities, including the ability to engage in creative reasoning and summarization tasks effectively. This provides insights valuable for professionals in AI development and deployment.

**Detailed Description:**
– **Model Overview:**
– Two new models have been released: Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507.
– Both models are relatively lightweight, with a download size of 4GB and a memory usage of approximately 4.3GB on systems like the M2 MacBook Pro.

– **Capabilities:**
– The Qwen3-4B-Thinking model demonstrates a unique ability to critique and challenge absurd requests, showcasing a more advanced level of reasoning than typical models.
– It provided a humorous and insightful commentary on a user request for an SVG image of a pelican riding a bicycle, acknowledging the biological and physical impossibility of such a depiction.

– **Performance:**
– While the Instruct model operates efficiently and responsively, the Thinking model engages in a slightly slower, more deliberative processing approach, which leads to thoughtful outputs but longer response times.
– The Qwen3-4B-Instruct model successfully summarized a Hacker News conversation, proving effective at providing useful insights, although processing speed is a limitation compared to cloud-hosted models.

– **Insights Gained:**
– The conversation also reflects on the balance between model size and performance, emphasizing that smaller models like these can still effectively address specific tasks.
– The text mentions their potential for use in local retrieval-augmented generation (RAG) contexts, suggesting practical applications for security and privacy professionals exploring AI deployments.

– **Potential Applications:**
– The exploratory nature of using these models can serve R&D teams aiming to enhance AI tools for summarization and creative tasks, contributing to more robust generative AI solutions in various sectors, including security.

The content is significant in the realm of AI and generative security, as it highlights the evolving capabilities of smaller models and their practical experiments, enabling industry professionals to stay informed about the latest developments and potential applications in their fields.

.NET 1 10 2 2025 3 4 5 5G 7 a Act ads advanced age AI AI development AI tool AI tools All alt and anti app Apple Apple Silicon Application applications Arize art as at ated Augment augmented generation being Best Bi bicycle bio biological book C calling capabilities challenge CI CIA class CleaR Cloud co content Context conversation creative tasks D day days de decision demo deployment deployments development developments e effective efficient ELF end ERP exp fast faster file first Fly for full g Gen generation generative Generative AI Go GPT GPU GPUs graph graphics gs H hack hacker Hacker News high Highlight hosted hosted models HR http HTTPS human ICO image in industry industry professionals insights inter interpret io IRS ite J Just k keeping l Lance least led level Li lightweight limitations Lite llm llms lm load local logic long low M mac MacBook man max mean memory memory usage Mila Mode model models my N new news no non Notation nothing NPU o oE of on one ons OPM ory oS oss other out output Outputs over parameter parameter models pelican per performance personality phi physics play plugin potential practical applications pre privacy privacy professionals pro problem process processing processing speed professionals project ps punch Q question Qwen R rag rate RCE re react real reasoning red release response response times retrieval Retrieval-Augmented Generation return riding Ro RSA s sec sector security self Sig Silicon Sim size skills small smaller models solutions source specific speed SSE STAR start studio summarization SVG system systems T Tags: Task tasks team Teams ted test Testing text the thinking Thought Time times to token token context tokens tool tool calling tools Tor TP trie turn two type UI up US usage use user V val vector graphics WAN web weight Wi x yt z