Simon Willison’s Weblog: Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM

Jul 31, 2025

—

Source URL: https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/
Source: Simon Willison’s Weblog
Title: Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM

Feedly Summary: Qwen just released their sixth model(!) for this July called Qwen3-Coder-30B-A3B-Instruct – listed as Qwen3-Coder-Flash in their chat.qwen.ai interface.
It’s 30.5B total parameters with 3.3B active at any one time. This means it will fit on a 64GB Mac – and even a 32GB Mac if you quantize it – and can run really fast thanks to that smaller set of active parameters.
It’s a non-thinking model that is specially trained for coding tasks.
This is an exciting combination of properties: optimized for coding performance and speed and small enough to run on a mid-tier developer laptop.
Trying it out with LM Studio and Open WebUI
I like running models like this using Apple’s MLX framework. I ran GLM-4.5 Air the other day using the mlx-lm Python library directly, but this time I decided to try out the combination of LM Studio and Open WebUI.
(LM Studio has a decent interface built in, but I like the Open WebUI one slightly more.)
I installed the model by clicking the “Use model in LM Studio" button on LM Studio’s qwen/qwen3-coder-30b page. It gave me a bunch of options:

I chose the 6bit MLX model, which is a 24.82GB download. Other options include 4bit (17.19GB) and 8bit (32.46GB). The download sizes are roughly the same as the amount of RAM required to run the model – picking that 24GB one leaves 40GB free on my 64GB machine for other applications.
Then I opened the developer settings in LM Studio (the green folder icon) and turned on "Enable CORS" so I could access it from a separate Open WebUI instance.

Now I switched over to Open WebUI. I installed and ran it using uv like this:
uvx –python 3.11 open-webui serve
Then navigated tohttp://localhost:8080/ to access the interface. I opened their settings and configured a new "Connection" to LM Studio:

That needs a base URL of http://localhost:1234/v1 and a key of anything you like. I also set the optional prefix to lm just in case my Ollama installation – which Open WebUI detects automatically – ended up with any duplicate model names.
Having done all of that, I could select any of my LM Studio models in the Open WebUI interface and start running prompts.
A neat feature of Open WebUI is that it includes an automatic preview panel, which kicks in for fenced code blocks that include SVG or HTML:

Here’s the exported transcript for "Generate an SVG of a pelican riding a bicycle". It ran at almost 60 tokens a second!
Implementing Space Invaders
I tried my other recent simple benchmark prompt as well:

Write an HTML and JavaScript page implementing space invaders

I like this one because it’s a very short prompt that acts as shorthand for quite a complex set of features. There’s likely plenty of material in the training data to help the model achieve that goal but it’s still interesting to see if they manage to spit out something that works first time.
The first version it gave me worked out of the box, but was a little too hard – the enemy bullets move so fast that it’s almost impossible to avoid them:

You can try that out here.
I tried a follow-up prompt of "Make the enemy bullets a little slower". A system like Claude Artifacts or Claude Code implements tool calls for modifying files in place, but Open WebUI system I was using didn’t have a default equivalent which means the model had to output the full file a second time.
It did that, and slowed down the bullets, but it made a bunch of other changes as well, shown in this diff. I’m not too surprised by this – asking a 25GB local model to output a lengthy file with just a single change is quite a stretch.
Here’s the exported transcript for those two prompts.
Running LM Studio models with mlx-lm
LM Studio stores its models in the ~/.cache/lm-studio/models directory. This means you can use the mlx-lm Python library to run prompts through the same model like this:
uv run –isolated –with mlx-lm mlx_lm.generate \
–model ~/.cache/lm-studio/models/lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit \
–prompt "Write an HTML and JavaScript page implementing space invaders" \
-m 8192 –top-k 20 –top-p 0.8 –temp 0.7
Be aware that this will load a duplicate copy of the model into memory so you may want to quit LM Studio before running this command!
Accessing the model via my LLM tool
My LLM project provides a command-line tool and Python library for accessing large language models.
Since LM Studio offers an OpenAI-compatible API, you can configure LLM to access models through that API by creating or editing the ~/Library/Application\ Support/io.datasette.llm/extra-openai-models.yaml file:
zed ~/Library/Application\ Support/io.datasette.llm/extra-openai-models.yaml
I added the following YAML configuration:
– model_id: qwen3-coder-30b
model_name: qwen/qwen3-coder-30b
api_base: http://localhost:1234/v1
supports_tools: true
Provided LM Studio is running I can execute prompts from my terminal like this:
llm -m qwen3-coder-30b ‘A joke about a pelican and a cheesecake’

Why did the pelican refuse to eat the cheesecake?
Because it had a beak for dessert! 🥧🦜
(Or if you prefer: Because it was afraid of getting beak-sick from all that creamy goodness!)

(25GB clearly isn’t enough space for a functional sense of humor.)
More interestingly though, we can start exercising the Qwen model’s support for tool calling:
llm -m qwen3-coder-30b \
-T llm_version -T llm_time –td \
‘tell the time then show the version’
Here we are enabling LLM’s two default tools – one for telling the time and one for seeing the version of LLM that’s currently installed. The –td flag stands for –tools-debug.
The output looks like this, debug output included:
Tool call: llm_time({})
{
"utc_time": "2025-07-31 19:20:29 UTC",
"utc_time_iso": "2025-07-31T19:20:29.498635+00:00",
"local_timezone": "PDT",
"local_time": "2025-07-31 12:20:29",
"timezone_offset": "UTC-7:00",
"is_dst": true
}

Tool call: llm_version({})
0.26

The current time is:
– Local Time (PDT): 2025-07-31 12:20:29
– UTC Time: 2025-07-31 19:20:29

The installed version of the LLM is 0.26.

Pretty good! It managed two tool calls from a single prompt.
Sadly I couldn’t get it to work with some of my more complex plugins such as llm-tools-sqlite. I’m trying to figure out if that’s a bug in the model, the LM Studio layer or my own code for running tool prompts against OpenAI-compatible endpoints.
The month of Qwen
July has absolutely been the month of Qwen. The models they have released this month are outstanding, packing some extremely useful capabilities even into models I can run in 25GB of RAM or less on my own laptop.
If you’re looking for a competent coding model you can run locally Qwen3-Coder-30B-A3B is a very solid choice.
Tags: ai, generative-ai, llms, ai-assisted-programming, llm, uv, qwen, pelican-riding-a-bicycle, llm-release, lm-studio, space-invaders

AI Summary and Description: Yes

**Summary:** The text describes the release and capabilities of the Qwen3-Coder-30B-A3B-Instruct model optimized for coding tasks. Its features make it accessible for developers using personal computers with modest hardware, underscoring its significance for AI developers seeking efficient coding aids.

**Detailed Description:**

The text outlines the key attributes of the Qwen3-Coder-30B-A3B-Instruct model, specifically designed for coding applications. Here are the major points highlighted in the content:

– **Model Specifications:**
– **Size and Parameters:**
– Contains 30.5 billion total parameters—3.3 billion active at once.
– Can run on machines with as little as 32GB of RAM when quantized, supporting speed and efficiency.

– **Usability:**
– Engineered as a coding assistant with a focus on performance and response speed, the model can be deployed easily on a standard developer work machine, making it accessible for everyday tasks.

– **Running and Integration:**
– Can be utilized with various frameworks like Apple’s MLX and combined with LM Studio and Open WebUI for a user-friendly interface.
– The installation process is streamlined, offering several quantization options (like 6bit, 4bit, and 8bit variants) that dictate the size and performance of the model.

– **Performance Testing:**
– Practical examples demonstrate using the model to generate coding structures, including HTML and JavaScript, and the responsiveness when prompted to adjust features such as game mechanics in a coding exercise (Space Invaders implementation).
– The model produces immediate, usable code snippets further emphasizing its intent for coding-related tasks.

– **Tool Calling Capabilities:**
– The model supports tool calling, enabling it to perform specific functions (e.g., showing current time, API version, etc.) through prompts, showcasing its flexibility and utility in a programming context.

– **Future Potential:**
– The discussion highlights July as a landmark month for Qwen’s advancements in AI, particularly for coding, indicating strong potential for further innovation and capabilities in local AI model use.

Overall, the Qwen3-Coder-30B-A3B-Instruct model signifies a valuable addition to AI-assisted programming, fulfilling a niche need for developers looking to leverage AI tools directly on personal computing hardware without needing extensive resources. This development emphasizes significant implications for code efficiency and accessibility in software development.

.NET 1 2 2025 24 3 4 5 5G 7 a access accessibility Act advancement advancements age AI AI developers ai model AI tool AI tools ai-assisted-programming air and anti API app Apple Application applications Aria art as assistant assisted assisted programming at ated attribute Auto aware benchmark Bi bicycle Box Bug built by C Cache calling capabilities chat CI CIA Claude Claude Artifact Claude Code CleaR co code code efficiency coding coding assistant coding model coding performance coding tasks command command-line tool community compute computer Computing Configuration content Context CORS Current D data dataset datasette day de default demo design developer developers development e editing efficiency efficient end endpoint endpoints Engineer exp export face fact fast fault feature features file first flash flexibility following for framework frameworks free friendly friendly interface full function future future potential g Gen generative Go goal gs H hardware high Highlight HR http HTTPS ICO implementation implications in innovation installation Instance integration intent inter interface invaders io IRS ite J Java JavaScript Just k Key l land language language model language models large large language model large language models led Li library Lite llama llm llms lm load local localhost low M mac machine made making man mean media memory mid ML mlx Mode model model specifications model support models ModI my N nation needs new no non o of off offs ollama on one open Open Web openai OPM opt optimized options ory oS other out output over parameter pelican per performance performance testing plugin plugins point porting potential practical example pre Preview pro process programming project prompt prompts ps Py Python Python library Q quantization quantized Qwen R rag rate RCE re real red release resource resources response response speed review riding Ro s Sable sam sec settings short Sig Sim Simple single size sizes small software software development solid source space specific speed sql sqlite SSE STAR start structures studio support SVG system T Tags: Task tasks ted terminal test Testing text the thinking Time to token tokens tool tool calling tools Tor TP trained training training data trie turn two UI under up US usability use user user-friendly User-Friendly Interface uv V val version WAN Ware web Well Wi x yaml yt z zone