vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel

infinity-void6 · October 26, 2025, 4:01am

Setup:

- Model: llama-3.1-8b

- Hardware: 2x NVIDIA A40

- CUDA: 12.5, Driver: 555.42.06

- vLLM version: 0.10.1.1

- Serving command:

CUDA\_VISIBLE\_DEVICES=0,1 vllm serve ./llama-3.1-8b \

--tensor-parallel-size 2 \

--max-model-len 8192 \

--gpu-memory-utilization 0.9 \

--chat-template /opt/vllm\_templates/llama-chat.jinja \

--guided-decoding-backend outlines \

--host [0.0.0.0](http://0.0.0.0) \

--port 9000 \

--max-num-seqs 20

Problem:

- With max_model_len=4096 and top_k (top_k is number of chunks/docs retrieved) =2 in my semantic retrieval pipeline → works fine.

- With max_model_len=8192, multi-GPU TP=2, top_k=5 (top_k is number of chunks/docs retrieved) → server never returns an answer.

- Logs show extremely low throughput:

Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.2 tokens/s

GPU KV cache usage: 0.4%, Prefix cache hit rate: 66.4%

- Context size is ~2800–4000 tokens.

What I’ve tried:

- Reduced max_model_len → works

- Reduced top_k(top_k is number of chunks/docs retrieved)→ works

- Checked GPU memory → not fully used

Questions:

Is this a known KV cache / memory allocation bottleneck for long contexts in vLLM?
Are there ways to batch token processing or offload KV cache to CPU for large max_model_len?
Recommended vLLM flags for stable long-context inference on multi-GPU setups?

RunLLM · October 26, 2025, 4:02am

Yes, this is a known issue: with larger max_model_len and higher top_k (more retrieved docs/chunks), vLLM can become extremely slow or hang, especially in multi-GPU tensor parallel (TP) setups, even if GPU memory is not fully used. This is due to increased activation memory and KV cache requirements, as well as possible inefficiencies in how vLLM schedules and batches long-context requests in TP mode. The logs you see (very low throughput, low KV cache usage) are typical symptoms of this bottleneck. Reducing max_model_len or top_k alleviates the problem because it lowers the memory and scheduling pressure on the engine. See discussion in vLLM GitHub issue #10102 and #13074.

To improve stability and throughput for long-context, multi-GPU inference: (1) Lower --max-num-seqs to reduce concurrent requests and memory pressure; (2) Tune --max-num-batched-tokens (try 2048–8192) for better batching; (3) Use FP8 KV cache quantization if supported (--kv-cache-dtype fp8) to double cache capacity (Quantized KV Cache docs); (4) Offloading KV cache to CPU is not fully supported for partial offload—vLLM still requires enough GPU memory for at least one full sequence, and CPU offload is not a complete workaround (see this discussion). For best results, keep max_model_len as low as your use case allows, and monitor batch size and token count. Would you like a step-by-step tuning checklist or example command?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	482	September 8, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1163	May 26, 2025
It seems that vLLM stops due to prefill General	1	191	September 22, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	123	November 10, 2025
vLLM hangs silently with LLaMA 4 Scout beyond 3M tokens – need help General	1	119	October 5, 2025

vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel

Related topics