vLLM hangs silently with LLaMA 4 Scout beyond 3M tokens – need help

Hey everyone,

I’m running into an issue with vLLM 10.2 in offline LLM chat mode using LLaMA 4 Scout and multi-GPU setup:

  • GPUs: 8 × NVIDIA H100

  • VRAM capacity: ~15M tokens

  • Model context size: 10M tokens

  • OS / CUDA: Ubuntu 24.04, CUDA 12.9, NVIDIA Driver 580

Problem:
When I feed the model more than ~3M tokens, the process hangs silently. There are no errors in the logs even with full debugging enabled:

os.environ[“VLLM_TRACE_FUNCTION”] = “1”
os.environ[“CUDA_LAUNCH_BLOCKING”] = “1”
os.environ[“VLLM_LOGGING_LEVEL”] = “DEBUG”
os.environ[“NCCL_DEBUG”] = “TRACE”

Profiling shows:

  • Stuck at sched_yield / acquire_read in vllm/distributed/device_communication.py

  • Appears like a distributed communication deadlock or token scheduling bottleneck

Expectation:
The model should be able to generate beyond 3M tokens given the GPUs have enough memory.

Has anyone seen this before? Any tips or workarounds for handling long-context multi-GPU offline inference with vLLM?

Thanks in advance for any insights!

This is a known issue: vLLM can hang silently when processing >3M tokens on LLaMA 4 Scout in multi-GPU offline inference, even if GPU memory is sufficient. Profiling indicates a potential distributed communication deadlock, with the process stuck in sched_yield/acquire_read, and no error logs are produced even with full debugging enabled. There is currently no documented workaround or fix for this specific long-context, multi-GPU deadlock in the provided data sources, and the issue remains open for investigation by the vLLM team according to vLLM silently hangs on LLaMA Scout 4 with >3M tokens despite sufficient GPU memory.

Would you like more detail on possible debugging steps or related distributed deadlock issues in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.