Failed to run distribute Inference with vllm 0.8.2

The error you’re encountering is related to NCCL (NVIDIA Collective Communications Library) initialization, which is crucial for distributed training and inference. The error message “NCCL error: internal error” suggests a problem with the NCCL setup. According to the vLLM GitHub issue, this can be due to incorrect network configurations or missing NCCL plugins.

To resolve this, ensure that the network interface specified by NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME is correct and accessible across all nodes. Additionally, verify that the NCCL version is compatible with your setup. You might also want to check if the libnccl-net.so plugin is available and correctly loaded, as its absence can lead to fallback on less efficient network plugins. For more detailed debugging, set NCCL_DEBUG=INFO to get additional logs that might help identify the root cause.

Would you like more detailed troubleshooting steps or information on configuring NCCL for distributed setups?

Sources: