NCCL error across 2 machines 2x4GPUs need advice

I can run models across 4 GPUs on a single box (tensor parallel, single node works)
However I can’t run 8 models across 2 boxes.

Both machines are identical. each have 4xL40S

NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0
Ray is active
Active:
1 node_19387efbfabdfed5c239cae361a26268060e57a4827e00db347acf59
1 node_f651c250524a7ac53150880cef521b96a5a58261b7f79bec3b2a1702
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
0.0/256.0 CPU
0.0/8.0 GPU
0B/1.83TiB memory
0B/372.53GiB object_store_memory

I thought that testing with a smaller model and spanning across would provide an easier test (Get it to work on a single node then debug across 2)
But that failed with NCC issues at every step

my only “Near Success” was with RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8
It did hit the other servers GPUs as seen via nvidia-smi before then crashing out.

Any suggestions on how to sort out NCCL issues? Any suggestions on a smaller model to try?

ubuntu 22.04

vLLM 0.11.0

torch 2.8.0+cu128

NCCL 2.27.3

I tried with varying flags, tensor-parallel-size, enable-expert-parallel, pipeline-parallel-size and others

Here is the one that got close (simplifed logs)

(ray-vllm) (base) aiteam@oppenheimer:~/ray-vllm$ vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
INFO 11-04 23:27:15 [init.py:216] Automatically detected platform cuda.
(APIServer pid=31729) INFO 11-04 23:27:18 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=31729) INFO 11-04 23:27:18 [utils.py:233] non-default args: {‘model_tag’: ‘RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘model’: ‘RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘quantization’: ‘fp8’, ‘tensor_parallel_size’: 8}
(APIServer pid=31729) INFO 11-04 23:27:19 [model.py:547] Resolved architecture: Qwen3MoeForCausalLM
(APIServer pid=31729) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=31729) INFO 11-04 23:27:19 [model.py:1510] Using max model len 262144
(APIServer pid=31729) INFO 11-04 23:27:19 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 11-04 23:27:22 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [core.py:77] Initializing a V1 LLM engine…
(EngineCore_DP0 pid=31866) 2025-11-04 23:27:25,814 INFO worker.py:1832 – Connecting to existing Ray cluster at address: 10.1.90.251:6379…
(EngineCore_DP0 pid=31866) 2025-11-04 23:27:25,839 INFO worker.py:2012 – Connected to Ray cluster.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [ray_utils.py:345] No current placement group found. Creating a new placement group.

(EngineCore_DP0 pid=31866) WARNING 11-04 23:27:26 [ray_utils.py:206] tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node f651c250524a7ac53150880cef521b96a5a58261b7f79bec3b2a1702. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 8 GPUs available at each node.
(EngineCore_DP0 pid=31866) WARNING 11-04 23:27:26 [ray_utils.py:206] tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node 19387efbfabdfed5c239cae361a26268060e57a4827e00db347acf59. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 8 GPUs available at each node.

(EngineCore_DP0 pid=31866) INFO 11-04 23:27:26 [ray_distributed_executor.py:171] use_ray_spmd_worker: True
(EngineCore_DP0 pid=31866) (pid=6142) INFO 11-04 23:27:28 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:32 [ray_env.py:65] Copying the following environment variables to workers: [‘VLLM_USE_V1’, ‘VLLM_USE_RAY_COMPILED_DAG’, ‘VLLM_USE_RAY_SPMD_WORKER’, ‘VLLM_WORKER_MULTIPROC_METHOD’]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) [Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) INFO 11-04 23:27:35 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) INFO 11-04 23:27:35 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=31866) (pid=15136, ip=10.1.90.180) INFO 11-04 23:27:29 [init.py:216] Automatically detected platform cuda. [repeated 7x across cluster]
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)

(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py”, line 505, in _run_workers
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] ray_worker_outputs = ray.get(ray_worker_outputs)

(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=15134, ip=10.1.90.180, …)

(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py”, line 291, in NCCL_CHECK
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

(EngineCore_DP0 pid=31866) Process EngineCore_DP0:
(EngineCore_DP0 pid=31866) Traceback (most recent call last):

(EngineCore_DP0 pid=31866) RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:36 [ray_distributed_executor.py:122] Shutting down Ray distributed executor…
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] Error executing method ‘init_device’. This might cause deadlock in distributed execution.
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] Traceback (most recent call last):

(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) [Gloo] Rank 6 is connected to 7 peer ranks… [repeated 14x across cluster]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) INFO 11-04 23:27:35 [init.py:1384] Found nccl from library libnccl.so.2 [repeated 7x across cluster]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) INFO 11-04 23:27:35 [pynccl.py:103] vLLM is using nccl==2.27.3 [repeated 7x across cluster]

(APIServer pid=31729) Traceback (most recent call last):
(APIServer pid=31729) File “/home/aiteam/ray-vllm/.venv/bin/vllm”, line 10, in
(APIServer pid=31729) sys.exit(main())

(APIServer pid=31729) File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 785, in wait_for_engine_startup
(APIServer pid=31729) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=31729) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

other attempts.
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --gpu-memory-utilization 0.85
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --max-model-len 32768
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --gpu-memory-utilization 0.85
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8 --gpu-memory-utilization 0.85 --max-model-len 32768 --enforce-eager
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
vllm serve RedHatAI/DeepSeek-Coder-V2-Instruct-FP8 --tensor-parallel-size 4 --enable-expert-parallel
vllm serve RedHatAI/DeepSeek-Coder-V2-Instruct-FP8 --tensor-parallel-size 8 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 4
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 8
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 8 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel --quantization fp
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel --quantization fp8
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8

vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2 --enable-expert-parallel
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8 --enable-expert-parallel

Your NCCL errors when running vLLM across 2 nodes are likely due to network or NCCL misconfiguration. For multi-node tensor parallelism, you must ensure: (1) all nodes have the same environment (model path, Python, CUDA, NCCL, vLLM, etc.), (2) high-speed interconnect (ideally InfiniBand), and (3) correct Ray and NCCL environment variables. The error “NCCL error: unhandled system error” is often caused by network issues, missing NCCL env vars, or lack of GPUDirect RDMA. For L40S (no NVLink), pipeline parallelism is often more reliable than tensor parallelism across nodes, but both require careful setup.

Recommended steps:

  • Use the run_cluster.sh helper to start Ray and vLLM containers, setting VLLM_HOST_IP and ensuring all nodes can communicate.
  • Set NCCL env vars for your network, e.g., NCCL_SOCKET_IFNAME=eth0 (or your interface), and for InfiniBand: NCCL_IB_HCA=mlx5 and --privileged in Docker.
  • Run with NCCL_DEBUG=TRACE to get detailed logs and check for [send] via NET/Socket (bad) vs [send] via NET/IB/GDRDMA (good) in logs.
  • Download the model to the same path on all nodes.
  • Try a smaller model (e.g., Qwen2.5-7B) with --tensor-parallel-size 4 --pipeline-parallel-size 2 to test multi-node setup.
  • If you see “tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node”, this means vLLM is spreading tensor parallel across nodes, which is only efficient with fast interconnects like InfiniBand. For L40S, pipeline parallelism is recommended (see docs).

Would you like a step-by-step example command and troubleshooting checklist?

Sources:

Hint: Mention @RunLLM in the post for followups.