I can run models across 4 GPUs on a single box (tensor parallel, single node works)
However I can’t run 8 models across 2 boxes.
Both machines are identical. each have 4xL40S
NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0
Ray is active
Active:
1 node_19387efbfabdfed5c239cae361a26268060e57a4827e00db347acf59
1 node_f651c250524a7ac53150880cef521b96a5a58261b7f79bec3b2a1702
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Total Usage:
0.0/256.0 CPU
0.0/8.0 GPU
0B/1.83TiB memory
0B/372.53GiB object_store_memory
I thought that testing with a smaller model and spanning across would provide an easier test (Get it to work on a single node then debug across 2)
But that failed with NCC issues at every step
my only “Near Success” was with RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8
It did hit the other servers GPUs as seen via nvidia-smi before then crashing out.
Any suggestions on how to sort out NCCL issues? Any suggestions on a smaller model to try?
ubuntu 22.04
vLLM 0.11.0
torch 2.8.0+cu128
NCCL 2.27.3
I tried with varying flags, tensor-parallel-size, enable-expert-parallel, pipeline-parallel-size and others
Here is the one that got close (simplifed logs)
(ray-vllm) (base) aiteam@oppenheimer:~/ray-vllm$ vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
INFO 11-04 23:27:15 [init.py:216] Automatically detected platform cuda.
(APIServer pid=31729) INFO 11-04 23:27:18 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=31729) INFO 11-04 23:27:18 [utils.py:233] non-default args: {‘model_tag’: ‘RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘model’: ‘RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘quantization’: ‘fp8’, ‘tensor_parallel_size’: 8}
(APIServer pid=31729) INFO 11-04 23:27:19 [model.py:547] Resolved architecture: Qwen3MoeForCausalLM
(APIServer pid=31729) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=31729) INFO 11-04 23:27:19 [model.py:1510] Using max model len 262144
(APIServer pid=31729) INFO 11-04 23:27:19 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 11-04 23:27:22 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [core.py:77] Initializing a V1 LLM engine…
(EngineCore_DP0 pid=31866) 2025-11-04 23:27:25,814 INFO worker.py:1832 – Connecting to existing Ray cluster at address: 10.1.90.251:6379…
(EngineCore_DP0 pid=31866) 2025-11-04 23:27:25,839 INFO worker.py:2012 – Connected to Ray cluster.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [ray_utils.py:345] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=31866) WARNING 11-04 23:27:26 [ray_utils.py:206] tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node f651c250524a7ac53150880cef521b96a5a58261b7f79bec3b2a1702. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 8 GPUs available at each node.
(EngineCore_DP0 pid=31866) WARNING 11-04 23:27:26 [ray_utils.py:206] tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node 19387efbfabdfed5c239cae361a26268060e57a4827e00db347acf59. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 8 GPUs available at each node.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:26 [ray_distributed_executor.py:171] use_ray_spmd_worker: True
(EngineCore_DP0 pid=31866) (pid=6142) INFO 11-04 23:27:28 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:32 [ray_env.py:65] Copying the following environment variables to workers: [‘VLLM_USE_V1’, ‘VLLM_USE_RAY_COMPILED_DAG’, ‘VLLM_USE_RAY_SPMD_WORKER’, ‘VLLM_WORKER_MULTIPROC_METHOD’]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) [Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) INFO 11-04 23:27:35 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) INFO 11-04 23:27:35 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=31866) (pid=15136, ip=10.1.90.180) INFO 11-04 23:27:29 [init.py:216] Automatically detected platform cuda. [repeated 7x across cluster]
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
…
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py”, line 505, in _run_workers
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] ray_worker_outputs = ray.get(ray_worker_outputs)
…
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=15134, ip=10.1.90.180, …)
…
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py”, line 291, in NCCL_CHECK
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) Process EngineCore_DP0:
(EngineCore_DP0 pid=31866) Traceback (most recent call last):
…
(EngineCore_DP0 pid=31866) RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:36 [ray_distributed_executor.py:122] Shutting down Ray distributed executor…
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] Error executing method ‘init_device’. This might cause deadlock in distributed execution.
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] Traceback (most recent call last):
…
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) [Gloo] Rank 6 is connected to 7 peer ranks… [repeated 14x across cluster]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) INFO 11-04 23:27:35 [init.py:1384] Found nccl from library libnccl.so.2 [repeated 7x across cluster]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) INFO 11-04 23:27:35 [pynccl.py:103] vLLM is using nccl==2.27.3 [repeated 7x across cluster]
(APIServer pid=31729) Traceback (most recent call last):
(APIServer pid=31729) File “/home/aiteam/ray-vllm/.venv/bin/vllm”, line 10, in
(APIServer pid=31729) sys.exit(main())
…
(APIServer pid=31729) File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 785, in wait_for_engine_startup
(APIServer pid=31729) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=31729) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
other attempts.
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --gpu-memory-utilization 0.85
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --max-model-len 32768
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --gpu-memory-utilization 0.85
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8 --gpu-memory-utilization 0.85 --max-model-len 32768 --enforce-eager
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
vllm serve RedHatAI/DeepSeek-Coder-V2-Instruct-FP8 --tensor-parallel-size 4 --enable-expert-parallel
vllm serve RedHatAI/DeepSeek-Coder-V2-Instruct-FP8 --tensor-parallel-size 8 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 4
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 8
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 8 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel --quantization fp
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel --quantization fp8
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2 --enable-expert-parallel
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8 --enable-expert-parallel