Simon Willison’s Weblog: TimeScope: How Long Can Your Video Large Multimodal Model Go?

Source URL: https://simonwillison.net/2025/Jul/23/timescope/#atom-everything
Source: Simon Willison’s Weblog
Title: TimeScope: How Long Can Your Video Large Multimodal Model Go?

Feedly Summary: TimeScope: How Long Can Your Video Large Multimodal Model Go?
New open source benchmark for evaluating vision LLMs on how well they handle long videos:

TimeScope probes the limits of long-video capabilities by inserting several short (~5-10 second) video clips—our “needles"—into base videos ranging from 1 minute to 8 hours. With three distinct task types, it evaluates not just retrieval but synthesis, localization, and fine-grained motion analysis, providing a more holistic view of temporal comprehension.

Videos can be fed into image-accepting models by converting them into thousands of images of frames (a trick I’ve tried myself), so they were able to run the benchmark against models that included GPT 4.1, Qwen2.5-VL-7B and Llama-3.2 11B in addition to video supporting models like Gemini 2.5 Pro.

Two discoveries from the benchmark that stood out to me:

Model size isn’t everything. Qwen 2.5-VL 3B and 7B, as well as InternVL 2.5 models at 2B, 4B, and 8B parameters, exhibit nearly indistinguishable long-video curves to their smaller counterparts. All of them plateau at roughly the same context length, showing that simply scaling parameters does not automatically grant a longer temporal horizon.
Gemini 2.5-Pro is in a league of its own. It is the only model that maintains strong accuracy on videos longer than one hour.

You can explore the benchmark dataset on Hugging Face, which includes prompts like this one:

Answer the question based on the given video. Only give me the answer and do not output any other words.
Question: What does the golden retriever do after getting out of the box?
A: lies on the ground
B: kisses the man
C: eats the food
D: follows the baby
E: plays with the ball
F: gets back into the box

Tags: ai, generative-ai, llms, vision-llms, evals

AI Summary and Description: Yes

Summary: The text discusses TimeScope, an open-source benchmark designed for evaluating the effectiveness of vision Large Language Models (LLMs) in processing long-duration videos. This ensures a more comprehensive assessment of model performance regarding temporal understanding, revealing significant insights into model capabilities and limitations.

Detailed Description:

The TimeScope benchmark addresses the challenges of evaluating vision LLMs when handling extensive video content. Here are the key points of significance:

– **Benchmark Framework**:
– TimeScope evaluates how well models can manage long videos by embedding short clips (5-10 seconds) into base videos of 1 minute to 8 hours.
– It includes various task types, focusing on retrieval, synthesis, localization, and fine-grained motion analysis, thereby covering a broad spectrum of temporal comprehension.

– **Adaptation of Videos for Models**:
– The benchmark facilitates feeding videos into image-accepting models by converting them into sequences of thousands of images, enabling compatibility with different LLM architectures.
– It has been tested on prominent models, including GPT 4.1, Qwen2.5-VL-7B, and Llama-3.2 11B, along with specialized video models like Gemini 2.5 Pro.

– **Key Discoveries**:
– **Model Size vs. Performance**: The results indicate that increasing model size does not guarantee improved long-video processing capability. Models Qwen 2.5-VL 3B and 7B, along with InternVL 2.5 (2B, 4B, and 8B parameters), show similar performance curves, plateauing at roughly the same context length, highlighting the need for optimization beyond just increasing parameter counts.
– **Gemini 2.5-Pro’s Superiority**: It is noted as the standout model, maintaining strong accuracy with video durations exceeding one hour, setting it apart from its counterparts.

– **Practical Application**:
– The benchmark dataset can be accessed via Hugging Face, allowing users to experiment with queries linked to provided videos, enhancing the practical implications of this research.

This analysis provides necessary insights for professionals focused on AI, particularly in video processing, model evaluation, and understanding the implications of LLM performance in real-world applications where long video comprehension is crucial.