Simon Willison’s Weblog: GLM-4.5: Reasoning, Coding, and Agentic Abililties

Jul 28, 2025

—

Source URL: https://simonwillison.net/2025/Jul/28/glm-45/#atom-everything
Source: Simon Willison’s Weblog
Title: GLM-4.5: Reasoning, Coding, and Agentic Abililties

Feedly Summary: GLM-4.5: Reasoning, Coding, and Agentic Abililties
Another day, another significant new open weight model release from a Chinese frontier AI lab.
This time it’s Z.ai – who rebranded (at least in English) from Zhipu AI a few months ago. They just dropped GLM-4.5-Base, GLM-4.5 and GLM-4.5 Air on Hugging Face, all under an MIT license.
These are MoE hybrid reasoning models with thinking and non-thinking modes, similar to Qwen 3. GLM-4.5 is 355 billion total parameters with 32 billion active, GLM-4.5-Air is 106 billion total parameters and 12 billion active.
They started using MIT a few months ago for their GLM-4-0414 models – their older releases used a janky non-open-source custom license.
Z.ai’s own benchmarking (across 12 common benchmarks) ranked their GLM-4.5 3rd behind o3 and Grok-4 and just ahead of Claude Opus 4. They ranked GLM-4.5 Air 6th place just ahead of Claude 4 Sonnet. I haven’t seen any independent benchmarks yet.
The other models they included in their own benchmarks were o4-mini (high), Gemini 2.5 Pro, Qwen3-235B-Thinking-2507, DeepSeek-R1-0528, Kimi K2, GPT-4.1, DeepSeek-V3-0324. Notably absent: any of Meta’s Llama models, or any of Mistral’s. Did they deliberately only compare themselves to open weight models from other Chinese AI labs?
Both models have a 128,000 context length and are trained for tool calling, which honestly feels like table stakes for any model released in 2025 at this point.
It’s interesting to see them use Claude Code to run their own coding benchmarks:

To assess GLM-4.5’s agentic coding capabilities, we utilized Claude Code to evaluate performance against Claude-4-Sonnet, Kimi K2, and Qwen3-Coder across 52 coding tasks spanning frontend development, tool development, data analysis, testing, and algorithm implementation. […] The empirical results demonstrate that GLM-4.5 achieves a 53.9% win rate against Kimi K2 and exhibits dominant performance over Qwen3-Coder with an 80.8% success rate. While GLM-4.5 shows competitive performance, further optimization opportunities remain when compared to Claude-4-Sonnet.

They published the dataset for that benchmark as zai-org/CC-Bench-trajectories on Hugging Face. I think they’re using the word “trajectory" for what I would call a chat transcript.

Unlike DeepSeek-V3 and Kimi K2, we reduce the width (hidden dimension and number of routed experts) of the model while increasing the height (number of layers), as we found that deeper models exhibit better reasoning capacity.

They pre-trained on 15 trillion tokens, then an additional 7 trillion for code and reasoning:

Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains.

They also open sourced their post-training reinforcement learning harness, which they’ve called slime. That’s available at THUDM/slime on GitHub – THUDM is the Knowledge Engineer Group @ Tsinghua University, the University from which Zhipu AI spun out as an independent company.
This time I ran my pelican bechmark using the chat.z.ai chat interface, which offers free access (no account required) to both GLM 4.5 and GLM 4.5 Air. I had reasoning enabled for both.
Here’s what I got for "Generate an SVG of a pelican riding a bicycle" on GLM 4.5. I like how the pelican has its wings on the handlebars:

And GLM 4.5 Air:

Ivan Fioravanti shared a video of the mlx-community/GLM-4.5-Air-4bit quantized model running on a M4 Mac with 128GB of RAM, and it looks like a very strong contender for a local model that can write useful code. The cheapest 128GB Mac Studio costs around $3,500 right now, so genuinely great open weight coding models are creeping closer to being affordable on consumer machines.
Tags: ai, generative-ai, local-llms, llms, mlx, pelican-riding-a-bicycle, llm-reasoning, llm-release

AI Summary and Description: Yes

Summary: The text discusses the release of the GLM-4.5 model from Z.ai, highlighting its significant parameters, capabilities in reasoning and coding, and competitive benchmarks against other models. This is particularly relevant to AI security professionals, as it showcases advancements in AI models that could impact various applications including security tools and frameworks.

Detailed Description: The GLM-4.5 release by Z.ai, formerly known as Zhipu AI, marks a significant step in the development of large language models (LLMs). Here are the key points regarding GLM-4.5 and its implications:

– **Model Specifications**:
– Two model variants released: GLM-4.5-Base and GLM-4.5-Air.
– GLM-4.5 features 355 billion parameters, with 32 billion active, while GLM-4.5-Air has 106 billion total parameters and 12 billion active.

– **Benchmarking and Performance**:
– The models were benchmarked across 12 tasks and ranked competitively against leading models like Claude Opus 4 and o3.
– GLM-4.5 achieved a 53.9% win rate against Kimi K2 and an impressive 80.8% success rate over Qwen3-Coder in coding tasks.

– **Training and Dataset**:
– Pre-trained on a massive corpus of 15 trillion tokens, with an additional 7 trillion tokens focused on coding and reasoning tasks.
– The post-training reinforcement learning harness, named “slime,” has been made open source on GitHub.

– **Features**:
– Supports tool calling, emphasizing its utility in various applications.
– Capable of processing lengthy contexts, with a context length of 128,000 tokens, enhancing its interactive capabilities.

– **Implications for Security**:
– The open-source nature of the model and its accessible datasets may encourage the development of new AI-driven security solutions.
– As tools becomes increasingly powerful, they could potentially be leveraged in both defensive and offensive security contexts, underscoring the need for robust security measures to protect AI assets.

– **Market Trends**:
– The availability of such advanced models on consumer hardware signifies a shift in how powerful AI tools may become more accessible, influencing compliance, security, and ethical considerations in deploying these technologies.

Overall, the advancements in GLM-4.5 serve not only to illustrate the rapid development of AI capabilities but also highlight the corresponding need for vigilance in security and compliance within these evolving technological landscapes.

.NET 1 10 2 2025 24 3 3rd 4 5 5 model 5 Pro 53 7 a access account Act advanced advancement advancements after age agent agentic agentic coding AI AI capabilities ai model AI models AI security AI tool AI tools air algorithm analysis and anti API app Application applications Aria art as assets at availability being benchmark benchmarking benchmarks Bi bicycle by C calling capabilities capacity chat chat interface Chinese CI Claude Claude 4 Claude Code Claude Opus 4 co code coding coding models coding tasks community competitive competitive performance compliance consumer consumer hardware Context context length cost Costs cross D data data analysis dataset datasets day de deep DeepSeek demo development domain domains drive driven driven security e edge end end development Engineer ethical ethical considerations exp expert Experts face feature features first focused for framework frameworks free front g Gemini Gemini 2 Gen general generative git GitHub Go GPT Grok Group gs H hardware heap high Highlight http HTTPS hugging Hugging Face hybrid hybrid reasoning Hybrid Reasoning Model hybrid reasoning models implementation implications implications for security in Inforce inter interactive capabilities interface io IRS ite J Just k Key knowledge l Lance land language language model language models large large language model large language models Large Language Models (LLMs) leading learning least led Li license llama Llama models llm llms lm local logic low M mac Mac Studio machine made man market market trend market trends mass measures Meta Mila mini Mistral ML mlx Mode model model release model specifications model variants models MoE my N new no non o o3 oE of off offensive security on one only open open weight models open-source OPM opt optimization Opus ory oS other out over parameter pelican per performance point post potential Power pre pre-training pro process processing professionals ps Q quantized Qwen R R1 rag Rank rate RCE re reasoning reasoning mode reasoning model reasoning models reasoning tasks red reinforcement reinforcement learning release releases riding right Ro robust security RoT s sec security security and compliance security measure security measures security professionals security solutions security tool security tools SHA shift side Sig Sim solutions source specific SSE STAR start studio support SVG T Tags: Task tasks tech technological technological landscape technologies ted test Testing text the thinking Time to token tokens tool tool calling tool development tools Tor TP trained training trajectory trends trillion two UI under up US use V V3 val video vigilance Ware web weight weight models Wi x yt z