Yes, this is a known issue: with larger max_model_len and higher top_k (more retrieved docs/chunks), vLLM can become extremely slow or hang, especially in multi-GPU tensor parallel (TP) setups, even if GPU memory is not fully used. This is due to increased activation memory and KV cache requirements, as well as possible inefficiencies in how vLLM schedules and batches long-context requests in TP mode. The logs you see (very low throughput, low KV cache usage) are typical symptoms of this bottleneck. Reducing max_model_len or top_k alleviates the problem because it lowers the memory and scheduling pressure on the engine. See discussion in vLLM GitHub issue #10102 and #13074.
To improve stability and throughput for long-context, multi-GPU inference: (1) Lower --max-num-seqs to reduce concurrent requests and memory pressure; (2) Tune --max-num-batched-tokens (try 2048–8192) for better batching; (3) Use FP8 KV cache quantization if supported (--kv-cache-dtype fp8) to double cache capacity (Quantized KV Cache docs); (4) Offloading KV cache to CPU is not fully supported for partial offload—vLLM still requires enough GPU memory for at least one full sequence, and CPU offload is not a complete workaround (see this discussion). For best results, keep max_model_len as low as your use case allows, and monitor batch size and token count. Would you like a step-by-step tuning checklist or example command?