In normal cases, Qwen3.5 models ( Qwen3.5-397B for example) have Context Length: 262,144 natively. But Qwen/Qwen3.5-397B-A17B · Hugging Face also says it can be extended to at most 1,010,000 tokens. And I am curious about how to do it by vLLM?
To extend Qwen3.5’s context length beyond its native 262,144 tokens (e.g., up to 1,010,000 tokens), you must use RoPE scaling—specifically, the YaRN method—by overriding the model’s rope parameters via the --hf-overrides argument when launching vLLM. Set the VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 environment variable, and provide the desired rope_parameters and max_model_len in your serve command, for example:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve Qwen/Qwen3.5-397B-A17B \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11,11,10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' \
--max-model-len 1010000
This will enable the model to process up to ~1M tokens per request using vLLM. For more details, see the official Qwen3.5 vLLM recipe.
Would you like a step-by-step explanation or more details on the parameters?
Sources:
- https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html#__span-13-31
- https://github.com/vllm-project/vllm/blob/main/docs/features/context_extension.md
Hint: Mention @RunLLM in the post for followups.
Good! But can you explain for me what the meanings of each key and value item in the -hf-overrides json ? @RunLLM