NCCL error across 2 machines 2x4GPUs need advice

I can run models across 4 GPUs on a single box (tensor parallel, single node works)
However I can’t run 8 models across 2 boxes.

Both machines are identical. each have 4xL40S

NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0
Ray is active
Active:
1 node_19387efbfabdfed5c239cae361a26268060e57a4827e00db347acf59
1 node_f651c250524a7ac53150880cef521b96a5a58261b7f79bec3b2a1702
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
0.0/256.0 CPU
0.0/8.0 GPU
0B/1.83TiB memory
0B/372.53GiB object_store_memory

I thought that testing with a smaller model and spanning across would provide an easier test (Get it to work on a single node then debug across 2)
But that failed with NCC issues at every step

my only “Near Success” was with RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8
It did hit the other servers GPUs as seen via nvidia-smi before then crashing out.

Any suggestions on how to sort out NCCL issues? Any suggestions on a smaller model to try?

ubuntu 22.04

vLLM 0.11.0

torch 2.8.0+cu128

NCCL 2.27.3

I tried with varying flags, tensor-parallel-size, enable-expert-parallel, pipeline-parallel-size and others

Here is the one that got close (simplifed logs)

(ray-vllm) (base) aiteam@oppenheimer:~/ray-vllm$ vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
INFO 11-04 23:27:15 [init.py:216] Automatically detected platform cuda.
(APIServer pid=31729) INFO 11-04 23:27:18 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=31729) INFO 11-04 23:27:18 [utils.py:233] non-default args: {‘model_tag’: ‘RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘model’: ‘RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘quantization’: ‘fp8’, ‘tensor_parallel_size’: 8}
(APIServer pid=31729) INFO 11-04 23:27:19 [model.py:547] Resolved architecture: Qwen3MoeForCausalLM
(APIServer pid=31729) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=31729) INFO 11-04 23:27:19 [model.py:1510] Using max model len 262144
(APIServer pid=31729) INFO 11-04 23:27:19 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 11-04 23:27:22 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [core.py:77] Initializing a V1 LLM engine…
(EngineCore_DP0 pid=31866) 2025-11-04 23:27:25,814 INFO worker.py:1832 – Connecting to existing Ray cluster at address: 10.1.90.251:6379…
(EngineCore_DP0 pid=31866) 2025-11-04 23:27:25,839 INFO worker.py:2012 – Connected to Ray cluster.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:25 [ray_utils.py:345] No current placement group found. Creating a new placement group.

(EngineCore_DP0 pid=31866) WARNING 11-04 23:27:26 [ray_utils.py:206] tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node f651c250524a7ac53150880cef521b96a5a58261b7f79bec3b2a1702. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 8 GPUs available at each node.
(EngineCore_DP0 pid=31866) WARNING 11-04 23:27:26 [ray_utils.py:206] tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node 19387efbfabdfed5c239cae361a26268060e57a4827e00db347acf59. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 8 GPUs available at each node.

(EngineCore_DP0 pid=31866) INFO 11-04 23:27:26 [ray_distributed_executor.py:171] use_ray_spmd_worker: True
(EngineCore_DP0 pid=31866) (pid=6142) INFO 11-04 23:27:28 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:32 [ray_env.py:65] Copying the following environment variables to workers: [‘VLLM_USE_V1’, ‘VLLM_USE_RAY_COMPILED_DAG’, ‘VLLM_USE_RAY_SPMD_WORKER’, ‘VLLM_WORKER_MULTIPROC_METHOD’]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) [Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) INFO 11-04 23:27:35 [init.py:1384] Found nccl from library libnccl.so.2
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=6142) INFO 11-04 23:27:35 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=31866) (pid=15136, ip=10.1.90.180) INFO 11-04 23:27:29 [init.py:216] Automatically detected platform cuda. [repeated 7x across cluster]
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)

(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py”, line 505, in _run_workers
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] ray_worker_outputs = ray.get(ray_worker_outputs)

(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=15134, ip=10.1.90.180, …)

(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py”, line 291, in NCCL_CHECK
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_DP0 pid=31866) ERROR 11-04 23:27:36 [core.py:708] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

(EngineCore_DP0 pid=31866) Process EngineCore_DP0:
(EngineCore_DP0 pid=31866) Traceback (most recent call last):

(EngineCore_DP0 pid=31866) RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) INFO 11-04 23:27:36 [ray_distributed_executor.py:122] Shutting down Ray distributed executor…
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] Error executing method ‘init_device’. This might cause deadlock in distributed execution.
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] Traceback (most recent call last):

(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15134, ip=10.1.90.180) ERROR 11-04 23:27:36 [worker_base.py:275] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) [Gloo] Rank 6 is connected to 7 peer ranks… [repeated 14x across cluster]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) INFO 11-04 23:27:35 [init.py:1384] Found nccl from library libnccl.so.2 [repeated 7x across cluster]
(EngineCore_DP0 pid=31866) (RayWorkerWrapper pid=15136, ip=10.1.90.180) INFO 11-04 23:27:35 [pynccl.py:103] vLLM is using nccl==2.27.3 [repeated 7x across cluster]

(APIServer pid=31729) Traceback (most recent call last):
(APIServer pid=31729) File “/home/aiteam/ray-vllm/.venv/bin/vllm”, line 10, in
(APIServer pid=31729) sys.exit(main())

(APIServer pid=31729) File “/home/aiteam/ray-vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 785, in wait_for_engine_startup
(APIServer pid=31729) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=31729) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

other attempts.
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --gpu-memory-utilization 0.85
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --max-model-len 32768
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 4 --quantization fp8 --gpu-memory-utilization 0.85
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8 --gpu-memory-utilization 0.85 --max-model-len 32768 --enforce-eager
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8
vllm serve RedHatAI/DeepSeek-Coder-V2-Instruct-FP8 --tensor-parallel-size 4 --enable-expert-parallel
vllm serve RedHatAI/DeepSeek-Coder-V2-Instruct-FP8 --tensor-parallel-size 8 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 4
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 8
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 8 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel --quantization fp
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel --quantization fp8
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7 --enable-expert-parallel
vllm serve RedHatAI/Qwen2.5-Coder-7B-FP8-dynamic --tensor-parallel-size 7
vllm serve RedHatAI/Qwen3-Coder-480B-A35B-Instruct-FP8 --tensor-parallel-size 8 --quantization fp8

vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2 --enable-expert-parallel
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --tensor-parallel-size 8 --enable-expert-parallel

Your NCCL errors when running vLLM across 2 nodes are likely due to network or NCCL misconfiguration. For multi-node tensor parallelism, you must ensure: (1) all nodes have the same environment (model path, Python, CUDA, NCCL, vLLM, etc.), (2) high-speed interconnect (ideally InfiniBand), and (3) correct Ray and NCCL environment variables. The error “NCCL error: unhandled system error” is often caused by network issues, missing NCCL env vars, or lack of GPUDirect RDMA. For L40S (no NVLink), pipeline parallelism is often more reliable than tensor parallelism across nodes, but both require careful setup.

Recommended steps:

  • Use the run_cluster.sh helper to start Ray and vLLM containers, setting VLLM_HOST_IP and ensuring all nodes can communicate.
  • Set NCCL env vars for your network, e.g., NCCL_SOCKET_IFNAME=eth0 (or your interface), and for InfiniBand: NCCL_IB_HCA=mlx5 and --privileged in Docker.
  • Run with NCCL_DEBUG=TRACE to get detailed logs and check for [send] via NET/Socket (bad) vs [send] via NET/IB/GDRDMA (good) in logs.
  • Download the model to the same path on all nodes.
  • Try a smaller model (e.g., Qwen2.5-7B) with --tensor-parallel-size 4 --pipeline-parallel-size 2 to test multi-node setup.
  • If you see “tensor_parallel_size=8 is bigger than a reserved number of GPUs (4 GPUs) in a node”, this means vLLM is spreading tensor parallel across nodes, which is only efficient with fast interconnects like InfiniBand. For L40S, pipeline parallelism is recommended (see docs).

Would you like a step-by-step example command and troubleshooting checklist?

Sources:

Hint: Mention @RunLLM in the post for followups.

Hey, I am running in a similar problem. vLLM does not work with 2 GPUs on the same node. It never uses RayWorker, it calls for Gloo but sees more GPUs than expected.

running with:

±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------±-----------------------±---------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   29C    P0             62W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
±----------------------------------------±-----------------------±---------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:C8:00.0 Off |                    0 |
| N/A   29C    P0             58W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
±----------------------------------------±-----------------------±---------------------+

here’s the output for CUDA_VISIBLE_DEVICES=0,1 vllm serve “Qwen/Qwen2.5-VL-32B-Instruct” -tp 2

INFO 11-12 11:24:52 [init.py:216] Automatically detected platform cuda.
(APIServer pid=1058019) INFO 11-12 11:25:01 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1058019) INFO 11-12 11:25:01 [utils.py:233] non-default args: {‘model_tag’: ‘Qwen/Qwen2.5-VL-32B-Instruct’, ‘model’: ‘Qwen/Qwen2.5-VL-32B-Instruct’, ‘tensor_parallel_size’: 2}
(APIServer pid=1058019) WARNING 11-12 11:25:01 [init.py:2877] Found ulimit of 16384 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like OSError: [Errno 24] Too many open files. Consider increasing with ulimit -n
(APIServer pid=1058019) INFO 11-12 11:25:01 [model.py:547] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(APIServer pid=1058019) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=1058019) INFO 11-12 11:25:01 [model.py:1510] Using max model len 128000
(APIServer pid=1058019) INFO 11-12 11:25:02 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 11-12 11:25:09 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=1058272) INFO 11-12 11:25:18 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=1058272) INFO 11-12 11:25:18 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model=‘Qwen/Qwen2.5-VL-32B-Instruct’, speculative_config=None, tokenizer=‘Qwen/Qwen2.5-VL-32B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-VL-32B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:
,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”,“vllm.plamo2_mamba_mixer”,“vllm.gdn_attention”,“vllm.sparse_attn_indexer”],“use_inductor”:true,“compile_sizes”:
,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:[2,1],“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“use_inductor_graph_partition”:false,“pass_config”:{},“max_capture_size”:512,“local_cache_dir”:null}
(EngineCore_DP0 pid=1058272) WARNING 11-12 11:25:18 [multiproc_executor.py:720] Reducing Torch parallelism from 8 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=1058272) INFO 11-12 11:25:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, ‘psm_7b313ac4’), local_subscribe_addr=‘ipc:///tmp/fda33040-f47c-4d58-871a-023472125946’, remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 11:25:24 [init.py:216] Automatically detected platform cuda.
INFO 11-12 11:25:24 [init.py:216] Automatically detected platform cuda.
INFO 11-12 11:25:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_cda25056’), local_subscribe_addr=‘ipc:///tmp/cb6cb860-17b9-4270-9469-527a6753052e’, remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 11:25:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_16f33670’), local_subscribe_addr=‘ipc:///tmp/551a51e5-02e4-405a-97d0-b2823d9da1c0’, remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank [Gloo] Rank 01 is connected to  is connected to 11 peer ranks.  peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : 11

[Gloo] Rank [Gloo] Rank 01 is connected to  is connected to 11 peer ranks.  peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : 11

INFO 11-12 11:25:36 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 11:25:36 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 11:25:36 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 11:25:36 [pynccl.py:103] vLLM is using nccl==2.27.3
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 498, in init
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]     super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 83, in init
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]   File “.../lib/python3.12/site-packages/vllm/executor/executor_base.py”, line 54, in init
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]     self._init_executor()
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]   File “.../lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 106, in _init_executor
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]   File “/.../lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 509, in wait_for_ready
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708]     raise e from None
(EngineCore_DP0 pid=1058272) ERROR 11-12 11:25:36 [core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=1058272) Process EngineCore_DP0:
(EngineCore_DP0 pid=1058272) Traceback (most recent call last):
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP0 pid=1058272)     self.run()
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore_DP0 pid=1058272)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 712, in run_engine_core
(EngineCore_DP0 pid=1058272)     raise e
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 699, in run_engine_core
(EngineCore_DP0 pid=1058272)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1058272)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 498, in init
(EngineCore_DP0 pid=1058272)     super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 83, in init
(EngineCore_DP0 pid=1058272)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1058272)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/executor/executor_base.py”, line 54, in init
(EngineCore_DP0 pid=1058272)     self._init_executor()
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 106, in _init_executor
(EngineCore_DP0 pid=1058272)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=1058272)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1058272)   File “.../lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py”, line 509, in wait_for_ready
(EngineCore_DP0 pid=1058272)     raise e from None
(EngineCore_DP0 pid=1058272) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1058019) Traceback (most recent call last):
(APIServer pid=1058019)   File “.../bin/vllm”, line 7, in 
(APIServer pid=1058019)     sys.exit(main())
(APIServer pid=1058019)              ^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/entrypoints/cli/main.py”, line 54, in main
(APIServer pid=1058019)     args.dispatch_function(args)
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py”, line 57, in cmd
(APIServer pid=1058019)     uvloop.run(run_server(args))
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/uvloop/init.py”, line 109, in run
(APIServer pid=1058019)     return __asyncio.run(
(APIServer pid=1058019)            ^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1058019)     return runner.run(main)
(APIServer pid=1058019)            ^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1058019)     return self._loop.run_until_complete(task)
(APIServer pid=1058019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/uvloop/init.py”, line 61, in wrapper
(APIServer pid=1058019)     return await main
(APIServer pid=1058019)            ^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 1884, in run_server
(APIServer pid=1058019)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 1902, in run_server_worker
(APIServer pid=1058019)     async with build_async_engine_client(
(APIServer pid=1058019)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1058019)     return await anext(self.gen)
(APIServer pid=1058019)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 180, in build_async_engine_client
(APIServer pid=1058019)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1058019)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1058019)     return await anext(self.gen)
(APIServer pid=1058019)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 225, in build_async_engine_client_from_engine_args
(APIServer pid=1058019)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1058019)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/utils/init.py”, line 1572, in inner
(APIServer pid=1058019)     return fn(*args, **kwargs)
(APIServer pid=1058019)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 207, in from_vllm_config
(APIServer pid=1058019)     return cls(
(APIServer pid=1058019)            ^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 134, in init
(APIServer pid=1058019)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1058019)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 102, in make_async_mp_client
(APIServer pid=1058019)     return AsyncMPClient(*client_args)
(APIServer pid=1058019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 769, in init
(APIServer pid=1058019)     super().init(
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 448, in init
(APIServer pid=1058019)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1058019)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1058019)   File “.../lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1058019)     next(self.gen)
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 732, in launch_core_engines
(APIServer pid=1058019)     wait_for_engine_startup(
(APIServer pid=1058019)   File “.../lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 785, in wait_for_engine_startup
(APIServer pid=1058019)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1058019) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}