Simon Willison’s Weblog: Trying out QvQ – Qwen’s new visual reasoning model

Source URL: https://simonwillison.net/2024/Dec/24/qvq/#atom-everything
Source: Simon Willison’s Weblog
Title: Trying out QvQ – Qwen’s new visual reasoning model

Feedly Summary: I thought we were done for major model releases in 2024, but apparently not: Alibaba’s Qwen team just dropped the Apache2 2 licensed QvQ-72B-Preview, “an experimental research model focusing on enhancing visual reasoning capabilities".
Their blog post is titled QvQ: To See the World with Wisdom – similar flowery language to their QwQ announcement QwQ: Reflect Deeply on the Boundaries of the Unknown a few weeks ago in November.
It’s a vision-focused follow-up to QwQ, which I wrote about previousy. QwQ is an impressive openly licensed inference-scaling model: give it a prompt and it will think out loud over many tokens while trying to derive a good answer, similar to OpenAI’s o1 and o3 models.
The new QvQ adds vision to the mix. You can try it out on Hugging Face Spaces – it accepts an image and a single prompt and then streams out a very long response where it thinks through the problem you have posed it. There’s no option to send a follow-up prompt.
I’ve tried it out with a bunch of things, with mixed results – but it’s really fun seeing how it works through a problem.
My most successful prompt was "Count the pelicans" with this image:

Here’s the full response. Some highlights:

So I’ve got this picture of some pelicans, and I need to count them. Let me see… First off, I know that pelicans are those big birds with the massive beaks, right? […]
Let me try counting again, more carefully this time. Starting from the left:

The first pelican is fully visible, standing at the front.
Behind it, slightly to the right, is the second pelican.
Then, further back, there’s a third pelican.
And next to that one, there’s a fourth pelican.

So, that’s four pelicans in total. But, I notice that on the far right of the image, there’s part of another bird, maybe a pelican’s back or tail feathers, but it’s not fully in the frame, so I don’t think I should count it as a whole pelican.

Something about the tone of voice it uses, "First off, I know that pelicans are those big birds with the massive beaks, right?", tickles me.
I also tried this on a much harder image – it counted 17 which looks about right to me but honestly I have trouble with that image myself.
I fed in one of the ARC-AGI puzzles that o3 had failed at.

It produced a very lengthy chain of thought that was almost entirely incorrect, but had some interesting "ideas" in it:

[…] Let me try to think in terms of cellular automata rules, like Conway’s Game of Life.
In Game of Life, each cell’s state in the next generation is determined by its current state and the states of its eight neighbors.
Perhaps a similar rule applies here.

Finally, I asked it to "Estimate the height of the dinosaur" against this image (which, as it correctly noted, is actually an inflatable dragon):

I’m not sure this task is even possible given the lack of items in the frame of comparable known height. It had a good go at it though! I particularly enjoyed how it began:

So I’ve got this inflatable Christmas decoration, and it’s a dragon! But not just any dragon—it’s a green dragon wearing a Santa hat and a red and white scarf. And get this, there’s a reindeer on its back, also wearing a Santa hat and holding onto the dragon’s wings. It’s pretty festive and colorful, with lots of vibrant greens, reds, and yellows.
The dragon has these big, friendly eyes and a smiling face, which makes it look really cheerful. Its wings are spread out, and it has white claws on its feet. The reindeer is brown with antlers and seems to be enjoying the ride on the dragon’s back.
Now, the question is, how tall is this inflatable dragon? I need to estimate its height. Let me see. I’m standing next to it, and I’m about 5 feet 10 inches tall. The dragon is definitely taller than me. Maybe around 8 or 9 feet high? But I should try to be more precise.

I wonder how it decided that its own height was 5 feet 10 inches!
Running QwQ locally
All of my experiments so far have used the hosted Hugging Face demo. I’m hoping to try it out on my own laptop soon – I think it should just fit in 64GB of M2, maybe with a smaller quantization.
Right now the options are:

Qwen/QVQ-72B-Preview on Hugging Face has the GPU model weights, for use with Hugging Face Transformers and the qwen-vl-utils Python package.
Prince Canuma is already converting the model for Apple’s MLX framework – it should hopefully be available soon via his excellent mlx-vlm package.

As a happy user of Ollama’s qwq port I’m hoping they add a QwQ release at some point soon as well.
Tags: ai, generative-ai, llms, hugging-face, vision-llms, qwen, inference-scaling

AI Summary and Description: Yes

**Summary:** The text discusses Alibaba’s recent release of the QvQ-72B-Preview model, which enhances visual reasoning capabilities, building upon the previously announced QwQ model. This development is significant as it offers insights into advancements in generative AI and visual reasoning, with practical examples of its functions and capabilities.

**Detailed Description:**
The text highlights the following major points regarding the QvQ-72B-Preview model:

– **Model Launch:** Alibaba’s Qwen team has released the QvQ-72B-Preview under an Apache2 license, signifying an ongoing innovation in AI capabilities for visual reasoning.
– **Relation to Previous Models:** The new model is described as a follow-up to the QwQ model, which has already shown promise in inference-scaling.
– **Functionality Overview:**
– The QvQ-72B-Preview processes images alongside prompts, streaming extensive reasoning before providing an answer, similar to models from OpenAI.
– Unlike some other models, it currently does not allow for follow-up prompts, which may limit conversational dynamics.
– **Experimental Observations:**
– Personal testing revealed mixed outcomes when entering prompts, illustrating both strengths and weaknesses of the model in problem-solving scenarios.
– An example of counting pelicans demonstrated the model’s reasoning process, revealing its approach to detail and thoughtful counting.
– **Complex Reasoning Attempts:**
– In more challenging tasks, the model engaged in detailed reasoning surrounding concepts such as cellular automata, albeit with inaccuracies, showcasing the exploratory nature of its outputs.
– When tasked with estimating the height of an inflatable decoration, the model articulated an amusing narrative while attempting to provide a height estimate, indicating its ability to generate contextual descriptions alongside reasoning.
– **Future Use Cases:**
– The user plans to run QvQ locally, which suggests potential for broader applications in various environments, particularly for users with substantial computing power (e.g., 64GB RAM).
– **Integration Possibilities:**
– Options for using QvQ-72B-Preview are available through Hugging Face, with additional developments in progress for compatibility with different machine learning frameworks.

With the emergence of models like QvQ-72B-Preview, significant advancements in visual reasoning within the generative AI domain are evident. These developments hold practical implications for AI, cloud, and infrastructure security practitioners, as they may lead to enhanced capabilities in applications such as automated image analysis, content generation, and complex decision-making scenarios.