Source URL: https://simonwillison.net/2025/Jan/26/qwen25-1m/
Source: Simon Willison’s Weblog
Title: Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens
Feedly Summary: Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens
Very significant new release from Alibaba’s Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I’ve had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called Dual Chunk Attention, first described in this paper from February 2024.
They’ve released two models on Hugging Face: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, both requiring CUDA and both under an Apache 2.0 license.
You’ll need a lot of VRAM to run them at their full capacity:
VRAM Requirement for processing 1 million-token sequences:
Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).
Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).
If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.
Qwen recommend using their custom fork of vLLM to serve the models:
You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.
GGUF quantized versions of the models see already starting to show up. LM Studio’s “official model curator" Bartowski published lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF and lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF – sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.
These might not work well yet with the full context lengths as the underlying llama.cpp library may need some changes.
I tried running the 8.1GB 7B model using Ollama on my Mac like this:
ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0
Then with LLM:
llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0
I tried piping a large prompt in like this:
files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m ‘describe this codebase in detail’
That should give me every Python file in my llm project. Piping that through ttok first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.
The result was disappointing: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:
2,048 input, 999 output
This suggests to me that something’s not working right here – maybe the Ollama hosting framework is truncating the input, or maybe there’s a problem with the model?
I’ll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.
Via VB
Tags: llms, ai, qwen, generative-ai
AI Summary and Description: Yes
Summary: The text discusses the significant advancements in Alibaba’s Qwen 2.5 LLM, which now supports an input token limit of up to 1 million tokens. This enables the processing of larger datasets and highlights innovations in model architectural techniques relevant to professionals focused on AI and infrastructure security.
Detailed Description:
– The release of Qwen 2.5 introduces a notable increase in input token capacity—from 128,000 tokens to 1 million tokens—thanks to a new technique called Dual Chunk Attention.
– This development is crucial for practitioners in the AI domain, particularly those involved with large language models (LLMs), as it directly affects how models can handle, process, and analyze data.
– The newly released models are available on Hugging Face, requiring significant VRAM resources:
– **Qwen2.5-7B-Instruct-1M** requires at least 120GB VRAM.
– **Qwen2.5-14B-Instruct-1M** requires at least 320GB VRAM.
– Users with limited VRAM can still utilize the models for tasks that do not require the full input capacity.
– The recommended serving framework for the models is a custom fork of vLLM, although users can resort to earlier frameworks at the risk of potential accuracy loss when exceeding the previous token limits.
– The emergence of GGUF quantized versions of these models indicates a trend towards more compact and efficient model usage, although preliminary attempts to achieve optimal performance with full context lengths on existing libraries may face challenges.
– A user report within the text indicates an experimentation outcome, revealing some issues with prompt handling in the Ollama hosting framework. This points to the potential complexities that may arise when working with larger models and highlights areas for further optimization.
Key Insights for Professionals:
– The increased input token capability significantly enhances the potential applications for LLMs, especially in processing and analyzing large datasets efficiently, which is a critical area for security and compliance in organizations.
– Understanding the VRAM requirements and model serving options is essential for those involved in the deployment and scaling of AI solutions.
– There is value in monitoring user experiences and knowledge sharing within the community as models evolve, which can inform best practices and troubleshooting methodologies.
This development exemplifies how evolving technologies in AI can directly influence infrastructure requirements and underscore the importance of adaptive strategies in model deployment and scaling for security and compliance purposes.