vLLM hangs silently with LLaMA 4 Scout beyond 3M tokens – need help

rajulshakya777 · October 5, 2025, 1:56am

Hey everyone,

I’m running into an issue with vLLM 10.2 in offline LLM chat mode using LLaMA 4 Scout and multi-GPU setup:

GPUs: 8 × NVIDIA H100
VRAM capacity: ~15M tokens
Model context size: 10M tokens
OS / CUDA: Ubuntu 24.04, CUDA 12.9, NVIDIA Driver 580

Problem:
When I feed the model more than ~3M tokens, the process hangs silently. There are no errors in the logs even with full debugging enabled:

os.environ[“VLLM_TRACE_FUNCTION”] = “1”
os.environ[“CUDA_LAUNCH_BLOCKING”] = “1”
os.environ[“VLLM_LOGGING_LEVEL”] = “DEBUG”
os.environ[“NCCL_DEBUG”] = “TRACE”

Profiling shows:

Stuck at sched_yield / acquire_read in vllm/distributed/device_communication.py
Appears like a distributed communication deadlock or token scheduling bottleneck

Expectation:
The model should be able to generate beyond 3M tokens given the GPUs have enough memory.

Has anyone seen this before? Any tips or workarounds for handling long-context multi-GPU offline inference with vLLM?

Thanks in advance for any insights!

RunLLM · October 5, 2025, 1:56am

This is a known issue: vLLM can hang silently when processing >3M tokens on LLaMA 4 Scout in multi-GPU offline inference, even if GPU memory is sufficient. Profiling indicates a potential distributed communication deadlock, with the process stuck in sched_yield/acquire_read, and no error logs are produced even with full debugging enabled. There is currently no documented workaround or fix for this specific long-context, multi-GPU deadlock in the provided data sources, and the issue remains open for investigation by the vLLM team according to vLLM silently hangs on LLaMA Scout 4 with >3M tokens despite sufficient GPU memory.

Would you like more detail on possible debugging steps or related distributed deadlock issues in vLLM?

Sources:

vLLM silently hangs on LLaMA Scout 4 with >3M tokens despite sufficient GPU memory

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Help with vLLM crashes General	1	215	December 16, 2025
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel KV-Cache	1	445	October 26, 2025
The vLLM server fails on large max-num-batched-tokens General	1	315	September 13, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	123	November 10, 2025
vLLM Tensor Parallel Workers Not Completing Initialization General	3	200	January 21, 2026

vLLM hangs silently with LLaMA 4 Scout beyond 3M tokens – need help

Related topics