Simon Willison’s Weblog: Qwen-Image: Crafting with Native Text Rendering

Source URL: https://simonwillison.net/2025/Aug/4/qwen-image/#atom-everything
Source: Simon Willison’s Weblog
Title: Qwen-Image: Crafting with Native Text Rendering

Feedly Summary: Qwen-Image: Crafting with Native Text Rendering
Not content with releasing six excellent open weights LLMs in July, Qwen are kicking off August with their first ever image generation model.
Qwen-Image is a 20 billion parameter MMDiT (Multimodal Diffusion Transformer, originaly proposed for Stable Diffusion 3) model under an Apache 2.0 license. The Hugging Face repo is 53.97GB.
Qwen released a detailed technical report (PDF) to accompany the model. The model builds on their Qwen-2.5-VL vision LLM, and they also made extensive use of that model to help create some of their their training data:

In our data annotation pipeline, we utilize a capable image captioner (e.g., Qwen2.5-VL) to generate not only comprehensive image descriptions, but also structured metadata that captures essential image properties and quality attributes.
Instead of treating captioning and metadata extraction as independent tasks, we designed an annotation framework in which the captioner concurrently describes visual content and generates detailed information in a structured format, such as JSON. Critical details such as object attributes, spatial relationships, environmental context, and verbatim transcriptions of visible text are captured in the caption, while key image properties like type, style, presence of watermarks, and abnormal elements (e.g., QR codes or facial mosaics) are reported in a structured format.

They put a lot of effort into the model’s ability to render text in a useful way. 5% of the training data (described as “billions of image-text pairs") was data "synthesized through controlled text rendering techniques", ranging from simple text through text on an image background up to much more complex layout examples:

To improve the model’s capacity to follow complex, structured prompts involving layout-sensitive content, we propose a synthesis strategy based on programmatic editing of pre-defined templates, such as PowerPoint slides or User Interface Mockups. A comprehensive rule-based system is designed to automate the substitution of placeholder text while maintaining the integrity of layout structure, alignment, and formatting.

I tried the model out using the ModelScope demo – I signed in with GitHub and verified my account via a text message to a phone number. Here’s what I got for "A raccoon holding a sign that says "I love trash" that was written by that raccoon":

The raccoon has very neat handwriting!
Via @Alibaba_Qwen
Tags: ai, stable-diffusion, generative-ai, vision-llms, training-data, qwen, text-to-image, ai-in-china

AI Summary and Description: Yes

Summary: The text describes the launch of Qwen-Image, a new image generation model that integrates advanced text rendering techniques. This model utilizes multimodal capabilities and emphasizes effective metadata extraction and structured data generation, which could be particularly relevant for professionals in generative AI security and image processing fields.

Detailed Description:
The provided content discusses the unveiling of Qwen-Image, a multimodal diffusion transformer model with 20 billion parameters, indicating significant advancements in generative AI technology. Key highlights include:

– **Model Overview**:
– Qwen-Image is released with an Apache 2.0 license and is backed by a substantial repository (53.97GB) on Hugging Face.
– This model follows the release of six other LLMs, showcasing ongoing innovation by the Qwen team.

– **Data Annotation Pipeline**:
– The model employs a sophisticated image captioning system, leveraging the Qwen-2.5-VL vision LLM for both image description and structured metadata generation.
– The annotation framework integrates tasks for enhanced efficiency, capturing critical aspects such as:
– Object attributes.
– Spatial relationships.
– Environmental context.
– Text transcriptions from visible elements on images.

– **Text Rendering Capabilities**:
– 5% of Qwen-Image’s training data includes data synthesized through controlled text rendering, showcasing the model’s focus on efficient text incorporation in image contexts.
– A synthesis strategy utilizing programmatic templates (like PowerPoint slides) allows for complex layout handling while ensuring proper formatting and structure.

– **User Experience**:
– The user tested the model and shared a creative outcome, demonstrating its effectiveness in generating detailed images based on structured text prompts.

– **Implications for Security and Compliance**:
– The detailed metadata extraction and structured data generation could assist in enhancing content governance and compliance measures, outlining how AI-generated content can be effectively managed and audited.
– Given the model’s capabilities in terms of text rendering within images, security professionals should also be cautious of potential misuse, including generating deceptive images or misinformation.

Overall, Qwen-Image signifies an important step in the evolution of generative AI technologies, combining advanced image generation with rigorous data handling techniques valuable for security, compliance, and infrastructure professionals.