Source URL: https://simonwillison.net/2025/Sep/27/video-models-are-zero-shot-learners-and-reasoners/
Source: Simon Willison’s Weblog
Title: Video models are zero-shot learners and reasoners
Feedly Summary: Video models are zero-shot learners and reasoners
Fascinating new paper from Google DeepMind which makes a very convincing case that their Veo 3 model – and generative video models in general – serve a similar role in the machine learning visual ecosystem as LLMs do for text.
LLMs took the ability to predict the next token and turned it into general purpose foundation models for all manner of tasks that used to be handled by dedicated models – summarization, translation, parts of speech tagging etc can now all be handled by single huge models, which are getting both more powerful and cheaper as time progresses.
Generative video models like Veo 3 may well serve the same role for vision and image reasoning tasks.
From the paper:
We believe that video models will become unifying, general-purpose foundation models for machine vision just like large language models (LLMs) have become foundation models for natural language processing (NLP). […]
Machine vision today in many ways resembles the state of NLP a few years ago: There are excellent task-specific models like “Segment Anything” for segmentation or YOLO variants for object detection. While attempts to unify some vision tasks exist, no existing model can solve any problem just by prompting. However, the exact same primitives that enabled zero-shot learning in NLP also apply to today’s generative video models—large-scale training with a generative objective (text/video continuation) on web-scale data. […]
Analyzing 18,384 generated videos across 62 qualitative and 7 quantitative tasks, we report that Veo 3 can solve a wide range of tasks that it was neither trained nor adapted for.
Based on its ability to perceive, model, and manipulate the visual world, Veo 3 shows early forms of “chain-of-frames (CoF)” visual reasoning like maze and symmetry solving.
While task-specific bespoke models still outperform a zero-shot video model, we observe a substantial and consistent performance improvement from Veo 2 to Veo 3, indicating a rapid advancement in the capabilities of video models.
I particularly enjoyed the way the coined the new term chain-of-frames to reflect chain-of-thought in LLMs. A chain-of-frames is how a video generation model can “reason" about the visual world:
Perception, modeling, and manipulation all integrate to tackle visual reasoning. While language models manipulate human-invented symbols, video models can apply changes across the dimensions of the real world: time and space. Since these changes are applied frame-by-frame in a generated video, this parallels chain-of-thought in LLMs and could therefore be called chain-of-frames, or CoF for short. In the language domain, chain-of-thought enabled models to tackle reasoning problems. Similarly, chain-of-frames (a.k.a. video generation) might enable video models to solve challenging visual problems that require step-by-step reasoning across time and space.
The PDF is 45 pages long but the main paper is just the first 9.5 pages – the rest is mostly appendices. Reading those first 10 pages will give you the full details of their argument.
Tags: google, video, ai, generative-ai, llms, gemini, paper-review, video-models
AI Summary and Description: Yes
Summary: The text discusses a groundbreaking paper from Google DeepMind that introduces the Veo 3 model, a generative video model that functions similarly to large language models (LLMs) in natural language processing. The author emphasizes the potential for Veo 3 to unify various visual tasks, reflecting advances in machine vision akin to those seen in NLP. Key insights include the introduction of “chain-of-frames” as a parallel to the “chain-of-thought” concept in LLMs, highlighting how video models can reason across time and space in generated videos.
Detailed Description:
The paper from Google DeepMind presents a compelling case for the Veo 3 model’s capabilities in performing tasks typically reserved for specialized machine learning models, positioning it as a general-purpose foundation model in the realm of machine vision similar to LLMs in text processing.
– **Core Innovations**:
– **Zero-shot Learning**: Veo 3 showcases the ability to perform a range of tasks without direct training or adaptation, leveraging the paradigm established by LLMs.
– **Generative Objectives**: The model is built on large-scale training with a generative objective, positioned to generate both text and video content effectively.
– **Performance Metrics**: An analysis of 18,384 generated videos across diverse tasks illustrates that while tailored models may outperform Veo 3 in specific tasks, the model exhibits considerable improvements and versatility.
– **Terminology**:
– The novel term “chain-of-frames” is introduced, which delineates the model’s capacity to reason through video content by integrating perception, modeling, and manipulation dynamically across frames—akin to reasoning in LLMs through a chain-of-thought approach.
– **Insights and Future Directions**:
– The ability of Veo 3 to solve complex visual reasoning problems could revolutionize how machine vision is approached, enabling a wide array of applications from simple video tasks to more intricate reasoning challenges.
– The analogy to LLMs implies a paradigm shift for video processing, suggesting that generative video models may soon become as foundational for vision tasks as LLMs are for language tasks.
With its promising capabilities and the introduction of innovative reasoning frameworks, this paper underscores the advancements in AI that could have profound implications on the future of intelligent systems and their applications in various domains, including security and compliance scenarios where video analysis might become crucial.