vLLM hangs during worker initialization on Blackwell PCIe GPUs unless --disable-custom-all-reduce is used

Description
When deploying a large model with tensor parallelism on a multi-GPU server, vLLM hangs during worker initialization. The logs repeatedly show:

[v1/engine/utils.py:1000] Waiting for 1 local, 0 remote core engine proc(s) to start.

Eventually it reports:

No available shared memory broadcast block found in 60 seconds.
This typically happens when some processes are hanging or doing some time-consuming work.

Both worker processes are alive but appear to be busy-waiting. Using strace shows the workers repeatedly calling sched_yield().

The issue disappears if --disable-custom-all-reduce is added.

Environment

OS:
Ubuntu 24.04.4 LTS

Kernel:
6.17.0-14-generic

Python:
3.12.11

GPU:
8 × NVIDIA RTX PRO 6000 Blackwell Server Edition (96GB)

Driver:
580.126.09

CUDA runtime:
13.0

PyTorch:
2.10.0+cu128

Torch CUDA:
12.8

NCCL:
2.27.5

vLLM:
0.17.0

GPU Topology

GPU0-3: NUMA node 0
GPU4-7: NUMA node 1
Connection type: NODE / SYS (PCIe + CPU interconnect)
No NVLink

Command

vllm serve /Data/models/Qwen/Qwen3-Next-80B-A3B-Instruct \
  --served-model-name Qwen3-Next-80B \
  --port 8000 \
  --tensor-parallel-size 2

Observed Behavior

The engine never finishes initializing workers. Logs repeatedly show:

Waiting for 1 local core engine proc(s) to start

Worker processes remain alive but stuck in a busy loop (sched_yield()).

Workaround

The issue is resolved by disabling custom all-reduce:

CUDA_VISIBLE_DEVICES=4,5 \
NCCL_P2P_DISABLE=1 \
vllm serve ... \
--tensor-parallel-size 2 \
--disable-custom-all-reduce

With this option, the server starts normally.

Additional Notes

  • Without NCCL_P2P_DISABLE=1, the process previously hung during NCCL initialization.

  • The GPUs are PCIe-only (no NVLink) and split across two NUMA nodes.

  • This might be related to custom all-reduce topology assumptions on multi-NUMA PCIe systems.

Question

Is this a known issue with custom all-reduce on multi-NUMA PCIe systems (especially Blackwell GPUs)?
Should custom all-reduce be automatically disabled in such environments?

Yes, this is a known issue: vLLM’s custom all-reduce can hang on PCIe-only, multi-NUMA systems (especially with Blackwell GPUs) due to P2P or shared memory assumptions that don’t hold across NUMA nodes. Disabling custom all-reduce (--disable-custom-all-reduce) or setting NCCL_P2P_DISABLE=1 is the recommended workaround and allows NCCL to handle communication, which is more robust on such topologies. This behavior has been reported by multiple users and is not unique to your setup.

Custom all-reduce is not always automatically disabled in these environments, but there are ongoing discussions and PRs to improve detection and fallback logic for PCIe/NUMA topologies. For now, manual disabling is required if you encounter hangs during initialization on multi-NUMA PCIe systems without NVLink, especially with Blackwell or similar GPUs. See details in this issue and related discussions.

Would you like a step-by-step explanation of why this happens and how to diagnose it?

Sources:

Hint: Mention @RunLLM in the post for followups.