Simon Willison’s Weblog: Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!

Source URL: https://simonwillison.net/2025/Apr/28/qwen25-omni/#atom-everything
Source: Simon Willison’s Weblog
Title: Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!

Feedly Summary: Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!
I’m not sure how I missed this one at the time, but last month (March 27th) Qwen released their first multi-modal model that can handle audio and video in addition to text and images.

We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.

Here’s the Qwen2.5-Omni Technical Report PDF.
As far as I can tell nobody has an easy path to getting it working on a Mac yet (the closest report I saw was this comment on Hugging Face).
This release is notable because, while there’s a pretty solid collection of open weight vision LLMs now, multi-modal models that go beyond that are still very rare. Like most of Qwen’s recent models, Qwen2.5 Omni is released under an Apache 2.0 license.
Qwen 3 is expected to release within the next 24 hours or so. @jianxliao captured a screenshot of their Hugging Face collection which they accidentally revealed before withdrawing it again which suggests the new model will be available in 0.6B / 1.7B / 4B / 8B / 30B sizes. I’m particularly excited to try the 30B one – 22-30B has established itself as my favorite size range for running models on my 64GB M2 as it often delivers exceptional results while still leaving me enough memory to run other applications at the same time.
Tags: vision-llms, llm-release, generative-ai, multi-modal-output, ai, qwen, llms

AI Summary and Description: Yes

Summary: The text discusses Qwen’s release of a groundbreaking multi-modal AI model, Qwen2.5 Omni, that integrates various modalities including audio, video, text, and images. This development is significant as it enhances capabilities in generating text and natural speech responses, marking a step forward in the realm of multi-modal AI systems.

Detailed Description:
– **Qwen2.5 Omni Release**: The latest model from Qwen offers a multi-modal approach, handling audio and video alongside traditional text and images. This model is significant because multi-modal AI systems that extend beyond vision-focused models are still rare in the current landscape.
– **Thinker-Talker Architecture**: This new architecture allows for the simultaneous perception of multiple input types while generating coherent outputs in various formats, including text and natural speech.
– **Novel Position Embedding**: The introduction of the TMRoPE (Time-aligned Multimodal RoPE) allows for better synchronization of video and audio inputs, which is crucial for creating more natural interactions in streams.
– **Open-Source Licensing**: Qwen2.5 Omni is released under the Apache 2.0 license, enabling wider access and community contributions.
– **Mention of Upcoming Models**: Anticipation for the Qwen 3 model is building, with expectations for various model sizes. There is particular excitement over the larger configurations (up to 30B parameters), which are often optimal for performance on certain hardware setups, balancing efficiency with capability.

Key Aspects:
– **Importance of Multi-Modal Models**: The rise of multi-modal systems reflects a growing trend in AI, pushing the boundaries of what AI can accomplish by integrating various forms of media.
– **Technical Challenges and Collaborations**: The text hints at challenges regarding deployment, such as compatibility with different operating systems, which is relevant for developers and researchers in AI who want to experiment with the model.
– **Community Engagement**: The mention of community platforms like Hugging Face indicates an active ecosystem around AI development, essential for fostering innovation and collaboration among AI practitioners.

This development is particularly relevant for AI professionals focusing on advancing the capabilities of AI systems and integrating them into various applications across industries. The emphasis on licensing and accessibility fosters a collaborative approach to AI research and development.