找不到内存广播块的问题(No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work)

我的配置是两张h100 pcie接口(无nvlink),在python的虚拟环境中使用vllm0.15版本进行双卡运行时出现了这个错误,因此我用1.5b小模型进行了实验,在单卡模式都可以正常运行并使用(分别指定了gpu0和1),但是加了参数–tensor-parallel-size 2后就会出现这个报错卡死,不能正常的输出,我已将共享内存调到了52g(总共只有64g内存),但貌似不起作用,询问了ai让我修改了许多有关nccl、SHM的环境变量也无济于事,有无合理的解决方案。

(My configuration consists of two H100 PCIe interfaces (without NVLink). When running with two GPUs in the Python virtual environment using vllm 0.15 version, this error occurred. Therefore, I conducted experiments with a 1.5b small model. In single-GPU mode, it could run normally and be used(Designated gpu0 and gpu1 respectively), but when adding the parameter –tensor-parallel-size 2, it would cause this error and crash, preventing normal output. I have set the shared memory to 52G (the total memory is only 64G). However, it seems to be ineffective. I have consulted AI and asked them to modify many environment variables related to NCCL and SHM, but it didn’t help. Is there any reasonable solution?)

下面是可能需要的信息(The following are the possible information that might be needed:):

系统:kubuntu 24.04

运行指令:CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/user/llm/Qwen2.5-1.5B-Instruct --served-model-name Qwen2.5-1.5B-Instruct --dtype auto --api-key token-abc123 --tensor-parallel-size 2 --gpu-memory-utilization 0.2

共享内存信息:df -h /dev/shm
umount: /dev/shm: target is busy.
Filesystem Size Used Avail Use% Mounted on
tmpfs 52G 0 52G 0% /dev/shm

n卡通讯状态:

(user) user@st650:~$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0-31,64-95 0 N/A
GPU1 NODE X 0-31,64-95 0 N/A

如需其他信息请告知我(If you need any other information, please let me know.)

根据你的描述,双卡H100(PCIe,无NVLink)在vLLM 0.15下,单卡推理正常,但加上--tensor-parallel-size 2后卡死,且共享内存已调大无效。此问题在无NVLink的多卡环境下非常常见,主要原因是NCCL默认会尝试P2P通信,但PCIe直连的卡间P2P支持有限或不支持,导致NCCL初始化或通信阶段hang住。

解决方法:
请在启动前设置环境变量export NCCL_P2P_DISABLE=1,强制NCCL禁用P2P通信,改为通过CPU转发数据。该方法已被多位用户验证有效,详见相关issue讨论官方文档说明调试建议
示例命令如下:

export NCCL_P2P_DISABLE=1
CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/user/llm/Qwen2.5-1.5B-Instruct --served-model-name Qwen2.5-1.5B-Instruct --dtype auto --api-key token-abc123 --tensor-parallel-size 2 --gpu-memory-utilization 0.2

如仍有问题,可进一步检查驱动版本、NCCL版本,或尝试升级驱动/NCCL库。需要更详细的排查步骤吗?

Sources:

Hint: Mention @RunLLM in the post for followups.