Simon Willison’s Weblog: Qwen-Image: Crafting with Native Text Rendering

Aug 4, 2025

—

Source URL: https://simonwillison.net/2025/Aug/4/qwen-image/#atom-everything
Source: Simon Willison’s Weblog
Title: Qwen-Image: Crafting with Native Text Rendering

Feedly Summary: Qwen-Image: Crafting with Native Text Rendering
Not content with releasing six excellent open weights LLMs in July, Qwen are kicking off August with their first ever image generation model.
Qwen-Image is a 20 billion parameter MMDiT (Multimodal Diffusion Transformer, originaly proposed for Stable Diffusion 3) model under an Apache 2.0 license. The Hugging Face repo is 53.97GB.
Qwen released a detailed technical report (PDF) to accompany the model. The model builds on their Qwen-2.5-VL vision LLM, and they also made extensive use of that model to help create some of their their training data:

In our data annotation pipeline, we utilize a capable image captioner (e.g., Qwen2.5-VL) to generate not only comprehensive image descriptions, but also structured metadata that captures essential image properties and quality attributes.
Instead of treating captioning and metadata extraction as independent tasks, we designed an annotation framework in which the captioner concurrently describes visual content and generates detailed information in a structured format, such as JSON. Critical details such as object attributes, spatial relationships, environmental context, and verbatim transcriptions of visible text are captured in the caption, while key image properties like type, style, presence of watermarks, and abnormal elements (e.g., QR codes or facial mosaics) are reported in a structured format.

They put a lot of effort into the model’s ability to render text in a useful way. 5% of the training data (described as “billions of image-text pairs") was data "synthesized through controlled text rendering techniques", ranging from simple text through text on an image background up to much more complex layout examples:

To improve the model’s capacity to follow complex, structured prompts involving layout-sensitive content, we propose a synthesis strategy based on programmatic editing of pre-defined templates, such as PowerPoint slides or User Interface Mockups. A comprehensive rule-based system is designed to automate the substitution of placeholder text while maintaining the integrity of layout structure, alignment, and formatting.

I tried the model out using the ModelScope demo – I signed in with GitHub and verified my account via a text message to a phone number. Here’s what I got for "A raccoon holding a sign that says "I love trash" that was written by that raccoon":

The raccoon has very neat handwriting!
Via @Alibaba_Qwen
Tags: ai, stable-diffusion, generative-ai, vision-llms, training-data, qwen, text-to-image, ai-in-china

AI Summary and Description: Yes

Summary: The text describes the launch of Qwen-Image, a new image generation model that integrates advanced text rendering techniques. This model utilizes multimodal capabilities and emphasizes effective metadata extraction and structured data generation, which could be particularly relevant for professionals in generative AI security and image processing fields.

Detailed Description:
The provided content discusses the unveiling of Qwen-Image, a multimodal diffusion transformer model with 20 billion parameters, indicating significant advancements in generative AI technology. Key highlights include:

– **Model Overview**:
– Qwen-Image is released with an Apache 2.0 license and is backed by a substantial repository (53.97GB) on Hugging Face.
– This model follows the release of six other LLMs, showcasing ongoing innovation by the Qwen team.

– **Data Annotation Pipeline**:
– The model employs a sophisticated image captioning system, leveraging the Qwen-2.5-VL vision LLM for both image description and structured metadata generation.
– The annotation framework integrates tasks for enhanced efficiency, capturing critical aspects such as:
– Object attributes.
– Spatial relationships.
– Environmental context.
– Text transcriptions from visible elements on images.

– **Text Rendering Capabilities**:
– 5% of Qwen-Image’s training data includes data synthesized through controlled text rendering, showcasing the model’s focus on efficient text incorporation in image contexts.
– A synthesis strategy utilizing programmatic templates (like PowerPoint slides) allows for complex layout handling while ensuring proper formatting and structure.

– **User Experience**:
– The user tested the model and shared a creative outcome, demonstrating its effectiveness in generating detailed images based on structured text prompts.

– **Implications for Security and Compliance**:
– The detailed metadata extraction and structured data generation could assist in enhancing content governance and compliance measures, outlining how AI-generated content can be effectively managed and audited.
– Given the model’s capabilities in terms of text rendering within images, security professionals should also be cautious of potential misuse, including generating deceptive images or misinformation.

Overall, Qwen-Image signifies an important step in the evolution of generative AI technologies, combining advanced image generation with rigorous data handling techniques valuable for security, compliance, and infrastructure professionals.

.NET 0 license 2 2025 3 4 5 53 7 a account Act advanced advancement advancements age AGI AI AI security AI technologies AI technology AI-generated content air Alibaba alignment and anti apach Apache Apache 2 Apache 2.0 Apache 2.0 license art as at ated attribute audit audited Auto based Bi by C capabilities capacity cell China CI CIA co code compliance compliance measures content Content Governance Context control critical Current D data data annotation data extraction data generation Data Handling de DeFi demo design diffusion diffusion transformer e editing effective effectiveness efficiency efficient end environment environmental ERP Excel exp experience extraction face fine first for framework g Gen generated Generated Content generation generative Generative AI git GitHub Go governance Governance and Compliance gs H handling high Highlight HR http HTTPS hugging Hugging Face image image captioning image generation image processing implications implications for security in information infrastructure infrastructure professionals innovation integrity inter interface io Iron IRS ite J json k Key l led Li license llm llms lm low M made man matt measures Meta metadata metadata generation misinformation misuse modal Mode model models modelscope multi Multimodal multimodal capabilities my N native new NGO no Notation o of off on one only ons open open weights ory oS other out outcome over Pair parameter pdf per phi Pipeline point potential Power pre pro process processing professionals prompt prompts ps Q QR code QR codes quality Qwen R rag rate RCE re red release rendering report repository Ro s scope sec security security and compliance security professionals sensitive content SHA Sig Sim Simple size sizes source SSE stable stable-diffusion Strategy structured structured data SUSE synthesis system T Tags: Tails Task tasks team tech technical techniques technologies technology ted templates test text text prompts the to Tor TP training training data Transform transformer transformer model trie type UI under up ups US use user user experience user interface V val Vision vision-llms visual content watermarks web weight Wi writing x yt z