Simon Willison’s Weblog: Video models are zero-shot learners and reasoners

Sep 28, 2025

—

Source URL: https://simonwillison.net/2025/Sep/27/video-models-are-zero-shot-learners-and-reasoners/
Source: Simon Willison’s Weblog
Title: Video models are zero-shot learners and reasoners

Feedly Summary: Video models are zero-shot learners and reasoners
Fascinating new paper from Google DeepMind which makes a very convincing case that their Veo 3 model – and generative video models in general – serve a similar role in the machine learning visual ecosystem as LLMs do for text.
LLMs took the ability to predict the next token and turned it into general purpose foundation models for all manner of tasks that used to be handled by dedicated models – summarization, translation, parts of speech tagging etc can now all be handled by single huge models, which are getting both more powerful and cheaper as time progresses.
Generative video models like Veo 3 may well serve the same role for vision and image reasoning tasks.
From the paper:

We believe that video models will become unifying, general-purpose foundation models for machine vision just like large language models (LLMs) have become foundation models for natural language processing (NLP). […]
Machine vision today in many ways resembles the state of NLP a few years ago: There are excellent task-specific models like “Segment Anything” for segmentation or YOLO variants for object detection. While attempts to unify some vision tasks exist, no existing model can solve any problem just by prompting. However, the exact same primitives that enabled zero-shot learning in NLP also apply to today’s generative video models—large-scale training with a generative objective (text/video continuation) on web-scale data. […]

Analyzing 18,384 generated videos across 62 qualitative and 7 quantitative tasks, we report that Veo 3 can solve a wide range of tasks that it was neither trained nor adapted for.
Based on its ability to perceive, model, and manipulate the visual world, Veo 3 shows early forms of “chain-of-frames (CoF)” visual reasoning like maze and symmetry solving.
While task-specific bespoke models still outperform a zero-shot video model, we observe a substantial and consistent performance improvement from Veo 2 to Veo 3, indicating a rapid advancement in the capabilities of video models.

I particularly enjoyed the way the coined the new term chain-of-frames to reflect chain-of-thought in LLMs. A chain-of-frames is how a video generation model can “reason" about the visual world:

Perception, modeling, and manipulation all integrate to tackle visual reasoning. While language models manipulate human-invented symbols, video models can apply changes across the dimensions of the real world: time and space. Since these changes are applied frame-by-frame in a generated video, this parallels chain-of-thought in LLMs and could therefore be called chain-of-frames, or CoF for short. In the language domain, chain-of-thought enabled models to tackle reasoning problems. Similarly, chain-of-frames (a.k.a. video generation) might enable video models to solve challenging visual problems that require step-by-step reasoning across time and space.

The PDF is 45 pages long but the main paper is just the first 9.5 pages – the rest is mostly appendices. Reading those first 10 pages will give you the full details of their argument.
Tags: google, video, ai, generative-ai, llms, gemini, paper-review, video-models

AI Summary and Description: Yes

Summary: The text discusses a groundbreaking paper from Google DeepMind that introduces the Veo 3 model, a generative video model that functions similarly to large language models (LLMs) in natural language processing. The author emphasizes the potential for Veo 3 to unify various visual tasks, reflecting advances in machine vision akin to those seen in NLP. Key insights include the introduction of “chain-of-frames” as a parallel to the “chain-of-thought” concept in LLMs, highlighting how video models can reason across time and space in generated videos.

Detailed Description:

The paper from Google DeepMind presents a compelling case for the Veo 3 model’s capabilities in performing tasks typically reserved for specialized machine learning models, positioning it as a general-purpose foundation model in the realm of machine vision similar to LLMs in text processing.

– **Core Innovations**:
– **Zero-shot Learning**: Veo 3 showcases the ability to perform a range of tasks without direct training or adaptation, leveraging the paradigm established by LLMs.
– **Generative Objectives**: The model is built on large-scale training with a generative objective, positioned to generate both text and video content effectively.
– **Performance Metrics**: An analysis of 18,384 generated videos across diverse tasks illustrates that while tailored models may outperform Veo 3 in specific tasks, the model exhibits considerable improvements and versatility.

– **Terminology**:
– The novel term “chain-of-frames” is introduced, which delineates the model’s capacity to reason through video content by integrating perception, modeling, and manipulation dynamically across frames—akin to reasoning in LLMs through a chain-of-thought approach.

– **Insights and Future Directions**:
– The ability of Veo 3 to solve complex visual reasoning problems could revolutionize how machine vision is approached, enabling a wide array of applications from simple video tasks to more intricate reasoning challenges.
– The analogy to LLMs implies a paradigm shift for video processing, suggesting that generative video models may soon become as foundational for vision tasks as LLMs are for language tasks.

With its promising capabilities and the introduction of innovative reasoning frameworks, this paper underscores the advancements in AI that could have profound implications on the future of intelligent systems and their applications in various domains, including security and compliance scenarios where video analysis might become crucial.

.NET 1 10 2 2025 3 4 5 7 a Act adaptation advancement advancements age AGI AI All analog analysis and anti API app Application applications Aria art as at ated based Bi bot built by C capabilities capacity cell chain challenge challenges CI CIA co compliance concept content core cross D data day de deep DeepMind detection domain domains e ecosystem effective end Excel first for foundation model foundation models frames framework frameworks full function future future directions g Gemini Gen general generated generation generative Go Google Google DeepMind gs H heap high Highlight HR http HTTPS human image implications in innovation Innovations insights Intel intelligent systems io iOS IRS J Just k Key l language language model language models language processing large large language model large language models Large Language Models (LLMs) learning led Li line llm llms lm long M mac machine Machine Learning machine learning model machine learning models machine vision man manipulation metrics Mila mini Mode model modeling models N natural language natural language processing natural language processing (NLP) new next NLP no o object detection of on one ons oS oss out paper Parallel pdf per perception performance performance improvement performance metrics potential Power pre pro problem process processing Progress prompt Prompting ps Q R rag rate Ray RCE re reading real reasoning reasoning challenge reasoning tasks red report review Ro Role RSA s sam Scale sec security security and compliance Segment segmentation shift short shot shot learning side Sig Sim Simon Willison Simple single size sizes solving source space specialized specific specific models Speech SSE state step reasoning summarization system systems T tagging Tags: Tails Task tasks ted terminology text the Thought Time to token TP trained training translation turn UI UN under US use uth V Veo Veo 2 Veo 3 veo 3 model versatility video video analysis video content video generation video generation model video model video models video processing video tasks Vision visual reasoning web Well Wi world x yt z zero