Failed to run distribute Inference with vllm 0.8.2

Your current environment

For the environment info, please refer to [Bug]: failed to run distribute Inference with vllm 0.8.2 · Issue #17160 · vllm-project/vllm · GitHub

I am trying to run the distribute inference in ray with vllm 0.8.2 and failed with the following errors, here are the steps:

  1. I followed this link to do the sanity check, and it passed:
    Troubleshooting — vLLM

ray head node

======== Autoscaler status: 2025-04-25 13:23:43.124555 ========
Node status
---------------------------------------------------------------
Active:
 1 node_1331c5c771898a75caaaeed528c77d9a10132b3f6f57a46bd444980d
 1 node_f9fee740f9505a910fa99a16799e10a4fad360251146be8297486b49
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 0.0/4.0 GPU
 0B/30.14GiB memory
 0B/12.92GiB object_store_memory

Demands:
 (no resource demands)

NCCL_SOCKET_IFNAME=ens4 GLOO_SOCKET_IFNAME=ens4 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr 172.21.151.99 test.py

ray worker node

NCCL_SOCKET_IFNAME=ens4 GLOO_SOCKET_IFNAME=ens4 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr 172.21.151.99 test.py
  1. I ran the follow command to start the distribute Inference in ray worker node:

my script:

export NCCL_SOCKET_IFNAME=ens4
export GLOO_SOCKET_IFNAME=ens4
export TP_SOCKET_IFNAME=ens4
#export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
#export NCCL_P2P_LEVEL=SYS
export VLLM_HOST_IP=172.21.222.20
#export NCCL_DEBUG_SUBSYS=ALL
export NCCL_SHM_DISABLE=1

export NCCL_NET=Socket
export NCCL_DEBUG=TRACE

vllm serve /models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
	--served-model-name DeepSeek-R1-Distill-Qwen-14B \
	--gpu-memory-utilization 0.95 \
	--dtype half \
	--tensor-parallel-size 2 \
	--enforce-eager \
	--trust-remote-code \
	--max-model-len 2048 \
	--enable-chunked-prefill \
	--pipeline-parallel-size 2
  1. got the following errors:
(ray) root@ds-x86:~/ray# ./ray-vllm-14B.sh
INFO 04-25 05:15:11 [__init__.py:239] Automatically detected platform cuda.
INFO 04-25 05:15:13 [api_server.py:981] vLLM API server version 0.8.2
INFO 04-25 05:15:13 [api_server.py:982] args: Namespace(subparser='serve', model_tag='/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', kv_cache_dtype='auto', max_model_len=2048, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=2, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['DeepSeek-R1-Distill-Qwen-14B'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7dc10ca6fba0>)
WARNING 04-25 05:15:13 [config.py:2614] Casting torch.bfloat16 to torch.float16.
INFO 04-25 05:15:20 [config.py:585] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 04-25 05:15:20 [arg_utils.py:1854] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
INFO 04-25 05:15:20 [config.py:1519] Defaulting to use ray for distributed inference
INFO 04-25 05:15:20 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-25 05:15:20 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B', speculative_config=None, tokenizer='/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=DeepSeek-R1-Distill-Qwen-14B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
2025-04-25 05:15:21,614	INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 172.21.151.99:6379...
2025-04-25 05:15:21,627	INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at http://172.21.151.99:8265
INFO 04-25 05:15:21 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 04-25 05:15:22 [ray_distributed_executor.py:176] use_ray_spmd_worker: False
(pid=3614692) INFO 04-25 05:15:26 [__init__.py:239] Automatically detected platform cuda.
INFO 04-25 05:15:28 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 04-25 05:15:28 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 04-25 05:15:28 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(RayWorkerWrapper pid=3614806) INFO 04-25 05:15:29 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=3614806) INFO 04-25 05:15:29 [cuda.py:288] Using XFormers backend.
INFO 04-25 05:15:29 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-25 05:15:29 [cuda.py:288] Using XFormers backend.
INFO 04-25 05:15:30 [utils.py:931] Found nccl from library libnccl.so.2
INFO 04-25 05:15:30 [pynccl.py:69] vLLM is using nccl==2.21.5
ds-x86:3614588:3614588 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens4
ds-x86:3614588:3614588 [0] NCCL INFO Bootstrap : Using ens4:172.21.222.20<0>
ds-x86:3614588:3614588 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ds-x86:3614588:3614588 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ds-x86:3614588:3614588 [0] NCCL INFO NET/Plugin: Using internal network plugin.
ds-x86:3614588:3614588 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.21.5+cuda12.4
(RayWorkerWrapper pid=3614806) INFO 04-25 05:15:30 [utils.py:931] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=3614806) INFO 04-25 05:15:30 [pynccl.py:69] vLLM is using nccl==2.21.5
ds-x86:3614588:3614588 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
ds-x86:3614588:3614588 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens4
ds-x86:3614588:3614588 [0] NCCL INFO NET/Socket : Using [0]ens4:172.21.222.20<0>
ds-x86:3614588:3614588 [0] NCCL INFO Using non-device net plugin version 0
ds-x86:3614588:3614588 [0] NCCL INFO Using network Socket
ds-x86:3614588:3614588 [0] NCCL INFO ncclCommInitRank comm 0x3f5280a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId a0 commId 0x9cd5c8fef97ef147 - Init START
ds-x86:3614588:3614588 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
ds-x86:3614588:3614588 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
ds-x86:3614588:3614588 [0] NCCL INFO comm 0x3f5280a0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
ds-x86:3614588:3614588 [0] NCCL INFO Channel 00/02 :    0   0

ds-x86:3614588:3614588 [0] graph/rings.cc:51 NCCL WARN Error : ring 0 does not contain rank 1
ds-x86:3614588:3614588 [0] NCCL INFO graph/connect.cc:489 -> 3
ds-x86:3614588:3614588 [0] NCCL INFO init.cc:1210 -> 3
ds-x86:3614588:3614588 [0] NCCL INFO init.cc:1548 -> 3
ds-x86:3614588:3614588 [0] NCCL INFO init.cc:1799 -> 3
ds-x86:3614588:3614588 [0] NCCL INFO init.cc:1837 -> 3
ERROR 04-25 05:15:31 [worker_base.py:620] Error executing method 'init_device'. This might cause deadlock in distributed execution.
ERROR 04-25 05:15:31 [worker_base.py:620] Traceback (most recent call last):
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 04-25 05:15:31 [worker_base.py:620]     return run_method(self, method, args, kwargs)
ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
ERROR 04-25 05:15:31 [worker_base.py:620]     return func(*args, **kwargs)
ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-25 05:15:31 [worker_base.py:620]     self.worker.init_device()  # type: ignore
ERROR 04-25 05:15:31 [worker_base.py:620]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker.py", line 166, in init_device
ERROR 04-25 05:15:31 [worker_base.py:620]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker.py", line 507, in init_worker_distributed_environment
ERROR 04-25 05:15:31 [worker_base.py:620]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 992, in ensure_model_parallel_initialized
ERROR 04-25 05:15:31 [worker_base.py:620]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 925, in initialize_model_parallel
ERROR 04-25 05:15:31 [worker_base.py:620]     _TP = init_model_parallel_group(group_ranks,
ERROR 04-25 05:15:31 [worker_base.py:620]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 730, in init_model_parallel_group
ERROR 04-25 05:15:31 [worker_base.py:620]     return GroupCoordinator(
ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 207, in __init__
ERROR 04-25 05:15:31 [worker_base.py:620]     self.device_communicator = device_comm_cls(
ERROR 04-25 05:15:31 [worker_base.py:620]                                ^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
ERROR 04-25 05:15:31 [worker_base.py:620]     self.pynccl_comm = PyNcclCommunicator(
ERROR 04-25 05:15:31 [worker_base.py:620]                        ^^^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
ERROR 04-25 05:15:31 [worker_base.py:620]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 04-25 05:15:31 [worker_base.py:620]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
ERROR 04-25 05:15:31 [worker_base.py:620]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
ERROR 04-25 05:15:31 [worker_base.py:620]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 04-25 05:15:31 [worker_base.py:620] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/miniconda3/envs/ray/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
[rank0]:     return __asyncio.run(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]:     return runner.run(main)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]:     return self._loop.run_until_complete(task)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]:     return await main
[rank0]:            ^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1016, in run_server
[rank0]:     async with build_async_engine_client(args) as engine_client:
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client
[rank0]:     async with build_async_engine_client_from_engine_args(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 189, in build_async_engine_client_from_engine_args
[rank0]:     engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 650, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 605, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 280, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 271, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 395, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]:     self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]:     raise e
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]:     return run_method(self, method, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
[rank0]:     self.worker.init_device()  # type: ignore
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker.py", line 166, in init_device
[rank0]:     init_worker_distributed_environment(self.vllm_config, self.rank,
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker.py", line 507, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 992, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 925, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 730, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 207, in __init__
[rank0]:     self.device_communicator = device_comm_cls(
[rank0]:                                ^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620] Error executing method 'init_device'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620] Traceback (most recent call last):
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 612, in execute_method
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     return func(*args, **kwargs)
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     return method(self, *_args, **_kwargs)
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     self.worker.init_device()  # type: ignore
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker.py", line 166, in init_device
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     init_worker_distributed_environment(self.vllm_config, self.rank,
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/worker/worker.py", line 507, in init_worker_distributed_environment
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 992, in ensure_model_parallel_initialized
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     initialize_model_parallel(tensor_model_parallel_size,
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 925, in initialize_model_parallel
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     _TP = init_model_parallel_group(group_ranks,
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 730, in init_model_parallel_group
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     return GroupCoordinator(
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 207, in __init__
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     self.device_communicator = device_comm_cls(
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]                                ^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     self.pynccl_comm = PyNcclCommunicator(
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]                        ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]   File "/root/miniconda3/envs/ray/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=3614806) ERROR 04-25 05:15:31 [worker_base.py:620] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
(pid=443095, ip=172.21.151.99) INFO 04-25 13:15:27 [__init__.py:239] Automatically detected platform cuda. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=443095, ip=172.21.151.99) INFO 04-25 13:15:29 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 2x across cluster]
(RayWorkerWrapper pid=443095, ip=172.21.151.99) INFO 04-25 13:15:29 [cuda.py:288] Using XFormers backend. [repeated 2x across cluster]
(RayWorkerWrapper pid=443095, ip=172.21.151.99) INFO 04-25 13:15:30 [utils.py:931] Found nccl from library libnccl.so.2 [repeated 2x across cluster]
(RayWorkerWrapper pid=443095, ip=172.21.151.99) INFO 04-25 13:15:30 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 2x across cluster]
INFO 04-25 05:15:32 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
[rank0]:[W425 05:15:32.229636086 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(ray) root@ds-x86:~/ray#

The error you’re encountering is related to NCCL (NVIDIA Collective Communications Library) initialization, which is crucial for distributed training and inference. The error message “NCCL error: internal error” suggests a problem with the NCCL setup. According to the vLLM GitHub issue, this can be due to incorrect network configurations or missing NCCL plugins.

To resolve this, ensure that the network interface specified by NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME is correct and accessible across all nodes. Additionally, verify that the NCCL version is compatible with your setup. You might also want to check if the libnccl-net.so plugin is available and correctly loaded, as its absence can lead to fallback on less efficient network plugins. For more detailed debugging, set NCCL_DEBUG=INFO to get additional logs that might help identify the root cause.

Would you like more detailed troubleshooting steps or information on configuring NCCL for distributed setups?

Sources:

Hi,

As I said in the first comment, I have verified the NCCL work properly through this link Troubleshooting — vLLM

could you please any specific commands to trouble-shoot ?

To troubleshoot NCCL issues further, you can use the following steps and commands:

  1. Verify Network Configuration: Ensure that the network interface specified by NCCL_SOCKET_IFNAME is correct and accessible from all nodes. You can check the available network interfaces using:

    ip addr show
    
  2. Check NCCL Version: Ensure that the NCCL version is consistent across all nodes and compatible with your CUDA version. You can verify the NCCL version using:

    echo $NCCL_VERSION
    
  3. Run NCCL Tests: Use the NCCL tests to verify communication between nodes. You can clone the NCCL tests repository and run the tests:

    git clone https://github.com/NVIDIA/nccl-tests.git
    cd nccl-tests
    make
    mpirun -np 2 -H <host1>,<host2> ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1
    
  4. Enable Detailed Logging: Set NCCL_DEBUG=INFO to get more detailed logs, which can help identify the issue:

    export NCCL_DEBUG=INFO
    
  5. Check for NCCL Plugin: Ensure that the libnccl-net.so plugin is available and correctly loaded. If not, you might need to install or configure it.

  6. Test with a Simple Script: Create a simple PyTorch script to test NCCL communication:

    import torch
    import torch.distributed as dist
    
    dist.init_process_group(backend='nccl')
    rank = dist.get_rank()
    size = dist.get_world_size()
    tensor = torch.ones(1).cuda(rank)
    dist.all_reduce(tensor)
    print(f'Rank {rank} has data {tensor[0]}')
    

Would you like more detailed guidance on any of these steps?

Sources:

Could anybody help taking a look at ?

To assist you effectively, could you please provide more specific details about the issue you’re facing? For instance, are there any specific error messages or logs that you haven’t shared yet? Additionally, it would be helpful to know if you’ve tried any other troubleshooting steps beyond verifying NCCL functionality. This information will help in diagnosing the problem more accurately.

Sources:

I need human assistant. the information you provide above is meaningless