Description
When deploying a large model with tensor parallelism on a multi-GPU server, vLLM hangs during worker initialization. The logs repeatedly show:
[v1/engine/utils.py:1000] Waiting for 1 local, 0 remote core engine proc(s) to start.
Eventually it reports:
No available shared memory broadcast block found in 60 seconds.
This typically happens when some processes are hanging or doing some time-consuming work.
Both worker processes are alive but appear to be busy-waiting. Using strace shows the workers repeatedly calling sched_yield().
The issue disappears if --disable-custom-all-reduce is added.
Environment
OS:
Ubuntu 24.04.4 LTS
Kernel:
6.17.0-14-generic
Python:
3.12.11
GPU:
8 × NVIDIA RTX PRO 6000 Blackwell Server Edition (96GB)
Driver:
580.126.09
CUDA runtime:
13.0
PyTorch:
2.10.0+cu128
Torch CUDA:
12.8
NCCL:
2.27.5
vLLM:
0.17.0
GPU Topology
GPU0-3: NUMA node 0
GPU4-7: NUMA node 1
Connection type: NODE / SYS (PCIe + CPU interconnect)
No NVLink
Command
vllm serve /Data/models/Qwen/Qwen3-Next-80B-A3B-Instruct \
--served-model-name Qwen3-Next-80B \
--port 8000 \
--tensor-parallel-size 2
Observed Behavior
The engine never finishes initializing workers. Logs repeatedly show:
Waiting for 1 local core engine proc(s) to start
Worker processes remain alive but stuck in a busy loop (sched_yield()).
Workaround
The issue is resolved by disabling custom all-reduce:
CUDA_VISIBLE_DEVICES=4,5 \
NCCL_P2P_DISABLE=1 \
vllm serve ... \
--tensor-parallel-size 2 \
--disable-custom-all-reduce
With this option, the server starts normally.
Additional Notes
-
Without
NCCL_P2P_DISABLE=1, the process previously hung during NCCL initialization. -
The GPUs are PCIe-only (no NVLink) and split across two NUMA nodes.
-
This might be related to custom all-reduce topology assumptions on multi-NUMA PCIe systems.
Question
Is this a known issue with custom all-reduce on multi-NUMA PCIe systems (especially Blackwell GPUs)?
Should custom all-reduce be automatically disabled in such environments?