Simon Willison’s Weblog: TimeScope: How Long Can Your Video Large Multimodal Model Go?

Jul 23, 2025

—

Source URL: https://simonwillison.net/2025/Jul/23/timescope/#atom-everything
Source: Simon Willison’s Weblog
Title: TimeScope: How Long Can Your Video Large Multimodal Model Go?

Feedly Summary: TimeScope: How Long Can Your Video Large Multimodal Model Go?
New open source benchmark for evaluating vision LLMs on how well they handle long videos:

TimeScope probes the limits of long-video capabilities by inserting several short (~5-10 second) video clips—our “needles"—into base videos ranging from 1 minute to 8 hours. With three distinct task types, it evaluates not just retrieval but synthesis, localization, and fine-grained motion analysis, providing a more holistic view of temporal comprehension.

Videos can be fed into image-accepting models by converting them into thousands of images of frames (a trick I’ve tried myself), so they were able to run the benchmark against models that included GPT 4.1, Qwen2.5-VL-7B and Llama-3.2 11B in addition to video supporting models like Gemini 2.5 Pro.

Two discoveries from the benchmark that stood out to me:

Model size isn’t everything. Qwen 2.5-VL 3B and 7B, as well as InternVL 2.5 models at 2B, 4B, and 8B parameters, exhibit nearly indistinguishable long-video curves to their smaller counterparts. All of them plateau at roughly the same context length, showing that simply scaling parameters does not automatically grant a longer temporal horizon.
Gemini 2.5-Pro is in a league of its own. It is the only model that maintains strong accuracy on videos longer than one hour.

You can explore the benchmark dataset on Hugging Face, which includes prompts like this one:

Answer the question based on the given video. Only give me the answer and do not output any other words.
Question: What does the golden retriever do after getting out of the box?
A: lies on the ground
B: kisses the man
C: eats the food
D: follows the baby
E: plays with the ball
F: gets back into the box

Tags: ai, generative-ai, llms, vision-llms, evals

AI Summary and Description: Yes

Summary: The text discusses TimeScope, an open-source benchmark designed for evaluating the effectiveness of vision Large Language Models (LLMs) in processing long-duration videos. This ensures a more comprehensive assessment of model performance regarding temporal understanding, revealing significant insights into model capabilities and limitations.

Detailed Description:

The TimeScope benchmark addresses the challenges of evaluating vision LLMs when handling extensive video content. Here are the key points of significance:

– **Benchmark Framework**:
– TimeScope evaluates how well models can manage long videos by embedding short clips (5-10 seconds) into base videos of 1 minute to 8 hours.
– It includes various task types, focusing on retrieval, synthesis, localization, and fine-grained motion analysis, thereby covering a broad spectrum of temporal comprehension.

– **Adaptation of Videos for Models**:
– The benchmark facilitates feeding videos into image-accepting models by converting them into sequences of thousands of images, enabling compatibility with different LLM architectures.
– It has been tested on prominent models, including GPT 4.1, Qwen2.5-VL-7B, and Llama-3.2 11B, along with specialized video models like Gemini 2.5 Pro.

– **Key Discoveries**:
– **Model Size vs. Performance**: The results indicate that increasing model size does not guarantee improved long-video processing capability. Models Qwen 2.5-VL 3B and 7B, along with InternVL 2.5 (2B, 4B, and 8B parameters), show similar performance curves, plateauing at roughly the same context length, highlighting the need for optimization beyond just increasing parameter counts.
– **Gemini 2.5-Pro’s Superiority**: It is noted as the standout model, maintaining strong accuracy with video durations exceeding one hour, setting it apart from its counterparts.

– **Practical Application**:
– The benchmark dataset can be accessed via Hugging Face, allowing users to experiment with queries linked to provided videos, enhancing the practical implications of this research.

This analysis provides necessary insights for professionals focused on AI, particularly in video processing, model evaluation, and understanding the implications of LLM performance in real-world applications where long video comprehension is crucial.

.NET 1 10 2 2025 3 4 5 5 model 5 models 5 Pro 7 a access accuracy Act adaptation addresses after AI analysis and app Application applications Arch architecture architectures art as assessment at Auto based benchmark benchmark design beyond Bi Box by C capabilities capability challenge challenges CI CIA CLIP co compatibility content Context context length D data dataset de design e effective effectiveness ELF ERP evals evaluation exp face fine focused for framework g Gemini Gemini 2 Gen generative Go GPT gs H handling high Highlight HR http HTTPS hugging Hugging Face image implications in insights inter intern io ite J Just k Key l language language model language models large large language model large language models Large Language Models (LLMs) led Li limitations Link linked llama llm llms lm local long low M man Mila mini modal Mode model model capabilities model evaluation model performance models multi Multimodal multimodal model my N new no o oE of on one only open open-source opt optimization oS other out output over parameter per performance play point porting practical implications pre pro process processing professionals prompt prompts ps Q queries question Qwen R RCE re real real-world applications research retrieval Ro s sam scaling scope search sec self sequence SHA short Sig Sim size small source specialized SSE support synthesis T Tags: Task task types ted Temporal temporal comprehension test text the Time times to TP trie two type UI under up US use user Users V val Valuation video video content video model video models video processing Vision vision-llms web Well Wi world world application world applications x yt z