Simon Willison’s Weblog: Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Jan 26, 2025

—

Source URL: https://simonwillison.net/2025/Jan/26/qwen25-1m/
Source: Simon Willison’s Weblog
Title: Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Feedly Summary: Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens
Very significant new release from Alibaba’s Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I’ve had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called Dual Chunk Attention, first described in this paper from February 2024.
They’ve released two models on Hugging Face: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, both requiring CUDA and both under an Apache 2.0 license.
You’ll need a lot of VRAM to run them at their full capacity:

VRAM Requirement for processing 1 million-token sequences:

Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).
Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).

If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.

Qwen recommend using their custom fork of vLLM to serve the models:

You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.

GGUF quantized versions of the models see already starting to show up. LM Studio’s “official model curator" Bartowski published lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF and lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF – sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.
These might not work well yet with the full context lengths as the underlying llama.cpp library may need some changes.
I tried running the 8.1GB 7B model using Ollama on my Mac like this:
ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

Then with LLM:
llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

I tried piping a large prompt in like this:
files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m ‘describe this codebase in detail’

That should give me every Python file in my llm project. Piping that through ttok first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.
The result was disappointing: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:
2,048 input, 999 output

This suggests to me that something’s not working right here – maybe the Ollama hosting framework is truncating the input, or maybe there’s a problem with the model?
I’ll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.
Via VB
Tags: llms, ai, qwen, generative-ai

AI Summary and Description: Yes

Summary: The text discusses the significant advancements in Alibaba’s Qwen 2.5 LLM, which now supports an input token limit of up to 1 million tokens. This enables the processing of larger datasets and highlights innovations in model architectural techniques relevant to professionals focused on AI and infrastructure security.

Detailed Description:
– The release of Qwen 2.5 introduces a notable increase in input token capacity—from 128,000 tokens to 1 million tokens—thanks to a new technique called Dual Chunk Attention.
– This development is crucial for practitioners in the AI domain, particularly those involved with large language models (LLMs), as it directly affects how models can handle, process, and analyze data.
– The newly released models are available on Hugging Face, requiring significant VRAM resources:
– **Qwen2.5-7B-Instruct-1M** requires at least 120GB VRAM.
– **Qwen2.5-14B-Instruct-1M** requires at least 320GB VRAM.

– Users with limited VRAM can still utilize the models for tasks that do not require the full input capacity.

– The recommended serving framework for the models is a custom fork of vLLM, although users can resort to earlier frameworks at the risk of potential accuracy loss when exceeding the previous token limits.

– The emergence of GGUF quantized versions of these models indicates a trend towards more compact and efficient model usage, although preliminary attempts to achieve optimal performance with full context lengths on existing libraries may face challenges.

– A user report within the text indicates an experimentation outcome, revealing some issues with prompt handling in the Ollama hosting framework. This points to the potential complexities that may arise when working with larger models and highlights areas for further optimization.

Key Insights for Professionals:
– The increased input token capability significantly enhances the potential applications for LLMs, especially in processing and analyzing large datasets efficiently, which is a critical area for security and compliance in organizations.
– Understanding the VRAM requirements and model serving options is essential for those involved in the deployment and scaling of AI solutions.
– There is value in monitoring user experiences and knowledge sharing within the community as models evolve, which can inform best practices and troubleshooting methodologies.

This development exemplifies how evolving technologies in AI can directly influence infrastructure requirements and underscore the importance of adaptive strategies in model deployment and scaling for security and compliance purposes.

.NET 0 license 01 1 2 2024 3 4 5 7 a accuracy Act adaptive advancement advancements AI Alibaba and anti Apache Apache 2 Apache 2.0 Apache 2.0 license Application applications Arch architectural art as Best best practices C capacity challenges CIA code codebase community compliance Context context length core critical cross D data dataset datasets de deployment development domain dual Dual Chunk Attention e edge efficient end exp experience experimentation face first focused for framework frameworks full g Gen generative GPU GPUs gs high Highlight hosting HR http HTTPS hugging Hugging Face in Inference Influence infrastructure infrastructure requirements infrastructure security innovation Innovations insights IRS ite J Just k knowledge knowledge sharing l language language model language models large large datasets large language model large language models Large Language Models (LLMs) least led libraries library llama llm llms lm long mac Mila model model deployment model serving model usage models Monitor monitoring my no NPU o of off ollama on one open openai opt optimization organization organizations out performance point post pre problem processing professionals prompt prompts Py Python Qwen R rate RCE red release report Requirements resources right Risk Ro s SAP scaling search sec security security and compliance sequence SHA sharing short Sig Sim source SSE start T Task tasks tech techniques technologies text the Time to token token capacity token usage tokens Tor TP trie troubleshooting two UI up update US usage use user user experience V val version web Well Wi x