Simon Willison’s Weblog: XBai o4

Aug 3, 2025

—

Source URL: https://simonwillison.net/2025/Aug/3/xbai-o4/#atom-everything
Source: Simon Willison’s Weblog
Title: XBai o4

Feedly Summary: XBai o4
Yet another open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims:

XBai o4 excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in Medium mode.

This a 32.8 billion parameter model released by MetaStone AI, a new-to-me lab who released their first model in March – MetaStone-L1-7B, then followed that with MetaStone-S1 1.5B, 7B and 32B in July and now XBai o4 in August.
The MetaStone-S1 models were accompanied with a with a paper, Test-Time Scaling with Reflective Generative Model.
There is very little information available on the English-language web about MetaStone AI. Their paper shows a relationship with USTC, University of Science and Technology of China in Hefei. One of their researchers confirmed on Twitter that their CEO is from KWAI which lead me to this Chinese language article from July last year about Li Yan, formerly of KWAI and now the founder of Wen Xiaobai and evidently now the CEO of MetaStone. www.wenxiaobai.com is listed as the “official website" linked to from the XBai-o4 README on GitHub.
Ivan Fioravanti got it working under MLX in 4bit, 5bit, 6bit, 8bit and 4bit-DWQ sizes. I tried his 6bit one (a 24.81GB download) in LM Studio and had it "Generate an SVG of a pelican riding a bicycle". It thought for 39.49 seconds (thinking trace here) and gave me this:

I also had it "Write an HTML and JavaScript page implementing space invaders", which you can try out here. It’s not great – the invaders don’t fire bullets (it listed that as a "potential improvement") and I don’t think it’s winnable due to timing issues, but it’s an interesting addition to my collection.
LM Studio reported a constant memory usage of 25.54GB and according to Activity Monitor used 87.5% of my GPU and around 39% of a CPU core.
It did a lot of work in the thinking trace – which took 10m35s! – including writing the first draft of much of the code that ended up in the finished result. Here’s an extract:

Let me think of variables:
let enemyDirection = 1; // 1 for right, -1 for left
In the game loop:
enemies.forEach(enemy => {>
enemy.x += enemyDirection * enemy.dx;>
});
Then check if any enemy is out of bounds. If so, reverse direction and move all enemies down.
But how to check if any enemy is at the edge?
Maybe find the max x of all enemies and see if it exceeds canvas.width – enemy.width.
Alternatively, track the leftmost and rightmost positions of enemies. Maybe a bit complex.
For simplicity, perhaps check if any enemy’s x + width > canvas.width or x < 0. Here's the full transcript. My initial impressions are that I'm not as impressed with this model for running on my own laptop as I was with Qwen3-Coder-30B-A3B-Instruct or GLM-4.5 Air. But... how extraordinary is it that another Chinese AI lab has been able to produce a competitive model, this time with far less fanfare than we've seen from Qwen and Moonshot AI and Z.ai. Via @ivanfioravanti Tags: ai, generative-ai, llms, mlx, pelican-riding-a-bicycle, llm-reasoning, llm-release, lm-studio, ai-in-china, space-invaders AI Summary and Description: Yes **Summary:** The text discusses the release of XBai o4, an open-source large language model (LLM) by a lesser-known Chinese AI lab, MetaStone AI. It highlights the model's capabilities in complex reasoning and its competitive parameters compared to other known models, as well as sharing insights into the lab's background and the performance of the model in practical applications. **Detailed Description:** The text provides an overview of a newly released LLM, XBai o4, emphasizing its relevance in the AI landscape regarding performance and the competitive nature of AI development in China. Here are the significant points: - **Model Release Information:** - XBai o4 is an open-source LLM launched by MetaStone AI, based in China. - The model features 32.8 billion parameters and is claimed to excel in complex reasoning, surpassing OpenAI-o3-mini. - **Background on MetaStone AI:** - MetaStone AI had previously launched several models including MetaStone-L1-7B and MetaStone-S1 series. - The lab's connection to the University of Science and Technology of China (USTC) is mentioned, adding credibility to their research and developments. - **Model Performance:** - Practical testing of XBai o4 under MLX yielded interesting results, including the generation of an SVG graphic and a JavaScript page for a simple 'Space Invaders' game. - Performance metrics include a constant memory usage of 25.54GB, with significant CPU and GPU utilization. - **Comparative Analysis:** - Initial impressions indicate that while XBai o4 is a competitive model, previous models like Qwen3-Coder-30B might have a better performance in certain aspects. - The text suggests an ongoing trend of emerging Chinese AI labs producing competitive models at a fast pace, often without significant global attention. - **Practical Implications:** - The ongoing emergence of competitive LLMs from various international players stresses the importance of continuous monitoring and adaptation by professionals in AI security and compliance. - The capabilities of these models in reasoning could potentially raise new challenges in LLM security and ethical considerations in AI deployment. Overall, the release of XBai o4 demonstrates the rapid advancements in AI technology and furthers the discourse surrounding the competitive dynamics in the AI field, particularly concerning security, performance, and ethical implications.

.NET 1 10 2 2025 24 3 32B 4 5 7 a Act adaptation advancement advancements age AI AI development AI landscape AI security AI technology air alt analysis and anti apach Apache Apache 2 Apache 2.0 API app Application applications Arch Aria art as at based Bi bicycle by C Canvas capabilities ceo CERN challenge challenges China Chinese CI CIA co code Col competitive competitive dynamics complex reasoning compliance constant continuous continuous monitoring core CPU D de demo deployment development developments e edge emerging end ethical ethical considerations ethical implications Excel fast feature features first for full g Gen generation generative generative model git GitHub Global Go GPU graph gs H high Highlight http HTTPS implications implicit in information insights inter intern invaders io IRS issue ite J Java JavaScript k l land language language model large large language model Large Language Model (LLM) led left Li Link linked llm llms lm load loop low M man max memory memory usage Meta Metas metrics mini ML mlx Mode model model card model performance model release models Monitor monitoring moonshot my N nation native new NGO no o o3 of off on one open open-source openai OPM ory oS other out over paper parameter pelican per performance performance metrics phi play players point potential practical applications practical implications pre pro professionals ps Q Qwen R rack Raise rate RCE re reasoning reasoning capabilities red reflective release report research Research and Development researchers reverse riding right Ro s scaling science search sec security security and compliance series SHA sharing shot side Sig Sim Simple simplicity size sizes source space SSE studio SVG T Tags: tech technology ted test Testing text the thinking Thought Time time scaling to Tor TP trie twitter under up US usage use utilization V web website Well Wi writing x yt z