Hey everyone,
I’m running into an issue with vLLM 10.2 in offline LLM chat mode using LLaMA 4 Scout and multi-GPU setup:
-
GPUs: 8 × NVIDIA H100
-
VRAM capacity: ~15M tokens
-
Model context size: 10M tokens
-
OS / CUDA: Ubuntu 24.04, CUDA 12.9, NVIDIA Driver 580
Problem:
When I feed the model more than ~3M tokens, the process hangs silently. There are no errors in the logs even with full debugging enabled:
os.environ[“VLLM_TRACE_FUNCTION”] = “1”
os.environ[“CUDA_LAUNCH_BLOCKING”] = “1”
os.environ[“VLLM_LOGGING_LEVEL”] = “DEBUG”
os.environ[“NCCL_DEBUG”] = “TRACE”
Profiling shows:
-
Stuck at
sched_yield/acquire_readinvllm/distributed/device_communication.py -
Appears like a distributed communication deadlock or token scheduling bottleneck
Expectation:
The model should be able to generate beyond 3M tokens given the GPUs have enough memory.
Has anyone seen this before? Any tips or workarounds for handling long-context multi-GPU offline inference with vLLM?
Thanks in advance for any insights!