Vllm启动时，日志卡在nccl相关部分，不继续往下

XiaoDouGeGe · August 20, 2025, 9:21am

root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export VLLM_LOGGING_LEVEL=DEBUG
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export NCCL_DEBUG=INFO
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /data-new/models/Qwen2.5-72B --served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code --gpu-memory-utilization 0.9 -tp 4
DEBUG 08-20 17:17:44 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:17:44 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:17:44 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:17:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:17:44 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:17:44 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:17:44 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:17:44 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:17:44 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:17:44 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:17:44 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:17:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:17:44 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:17:45 [utils.py:150] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-20 17:17:45 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:17:45 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:17:45 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:17:46 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 17:17:46 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 17:17:53 [config.py:823] This model supports multiple tasks: {‘embed’, ‘generate’, ‘score’, ‘classify’, ‘reward’}. Defaulting to ‘generate’.
DEBUG 08-20 17:17:53 [arg_utils.py:1600] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 08-20 17:17:53 [arg_utils.py:1607] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 08-20 17:17:53 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 17:17:53 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 17:17:56 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:17:57 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:17:57 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:17:57 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:17:57 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:17:57 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:17:57 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:17:57 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:17:57 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:17:57 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:17:57 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:17:57 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:17:57 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:17:57 [init.py:244] Automatically detected platform cuda.
INFO 08-20 17:18:00 [core.py:455] Waiting for init message from front-end.
DEBUG 08-20 17:18:00 [utils.py:547] HELLO from local core engine process 0.
DEBUG 08-20 17:18:00 [core.py:463] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=[‘ipc:///tmp/5d49e0ce-dfe9-49b2-a5f4-fe250cc30826’], outputs=[‘ipc:///tmp/66beb4ef-4061-4daa-b693-7e3fb030334e’], coordinator_input=None, coordinator_output=None), parallel_config={‘data_parallel_master_ip’: ‘127.0.0.1’, ‘data_parallel_master_port’: 0, ‘data_parallel_size’: 1})
DEBUG 08-20 17:18:00 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:18:00 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:18:00 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:18:00 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null}
WARNING 08-20 17:18:00 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
DEBUG 08-20 17:18:00 [shm_broadcast.py:243] Binding to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
INFO 08-20 17:18:00 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_8fd26fcc’), local_subscribe_addr=‘ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:18:03 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:18:03 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:18:03 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:18:03 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:18:03 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:18:03 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:18:03 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:18:03 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:18:03 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:18:03 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:18:03 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:18:03 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:18:03 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:18:03 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:18:03 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
…
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f23686a0050>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=1 pid=915) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/e3ed3031-b88c-4940-9e31-f54185abae0f
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_1dc59dc6’), local_subscribe_addr=‘ipc:///tmp/e3ed3031-b88c-4940-9e31-f54185abae0f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f421b5e73b0>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/e26551fd-33d3-48d0-b78a-a58e8757c0c8
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_bf4a9042’), local_subscribe_addr=‘ipc:///tmp/e26551fd-33d3-48d0-b78a-a58e8757c0c8’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:18:06 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:18:06 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:18:06 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9d17a2d760>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/aa38f861-f5b5-490c-98d2-8494edb22415
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_51d13c38’), local_subscribe_addr=‘ipc:///tmp/aa38f861-f5b5-490c-98d2-8494edb22415’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:07 [parallel_state.py:918] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=915) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=915) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
a49e15233991:914:914 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO cudaDriverVersion 12080
a49e15233991:914:914 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:916:916 [2] NCCL INFO cudaDriverVersion 12080
a49e15233991:916:916 [2] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:917:917 [3] NCCL INFO cudaDriverVersion 12080
a49e15233991:917:917 [3] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:915:915 [1] NCCL INFO cudaDriverVersion 12080
a49e15233991:915:915 [1] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:914:914 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:914:914 [0] NCCL INFO NET/IB : No device found.
a49e15233991:914:914 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:914:914 [0] NCCL INFO Using network Socket
a49e15233991:914:914 [0] NCCL INFO ncclCommInitRank comm 0xd878d20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5d717172769c2522 - Init START
a49e15233991:917:917 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:916:916 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:917:917 [3] NCCL INFO NET/IB : No device found.
a49e15233991:917:917 [3] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO NET/IB : No device found.
a49e15233991:916:916 [2] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:917:917 [3] NCCL INFO Using network Socket
a49e15233991:916:916 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:916:916 [2] NCCL INFO Using network Socket
a49e15233991:917:917 [3] NCCL INFO ncclCommInitRank comm 0xe44f7d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0x5d717172769c2522 - Init START
a49e15233991:916:916 [2] NCCL INFO ncclCommInitRank comm 0xd66b740 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 56000 commId 0x5d717172769c2522 - Init START
a49e15233991:917:917 [3] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:915:915 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:915:915 [1] NCCL INFO NET/IB : No device found.
a49e15233991:915:915 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:915:915 [1] NCCL INFO Using network Socket
a49e15233991:915:915 [1] NCCL INFO ncclCommInitRank comm 0xe33aaf0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d717172769c2522 - Init START
a49e15233991:914:914 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:915:915 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:916:916 [2] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:916:916 [2] NCCL INFO Bootstrap timings total 0.048889 (create 0.000016, send 0.000055, recv 0.000041, ring 0.000035, delay 0.000000)
a49e15233991:915:915 [1] NCCL INFO Bootstrap timings total 0.002144 (create 0.000020, send 0.000069, recv 0.000270, ring 0.001528, delay 0.000000)
a49e15233991:917:917 [3] NCCL INFO Bootstrap timings total 0.049082 (create 0.000017, send 0.000053, recv 0.000067, ring 0.048548, delay 0.000000)
a49e15233991:916:916 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO Bootstrap timings total 0.065509 (create 0.000022, send 0.000062, recv 0.063531, ring 0.001624, delay 0.000000)
a49e15233991:915:915 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:917:917 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:914:914 [0] NCCL INFO NVLS multicast support is not available on dev 0
a49e15233991:917:917 [3] NCCL INFO Setting affinity for GPU 3 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:917:917 [3] NCCL INFO NVLS multicast support is not available on dev 3
a49e15233991:916:916 [2] NCCL INFO Setting affinity for GPU 2 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:916:916 [2] NCCL INFO NVLS multicast support is not available on dev 2
a49e15233991:915:915 [1] NCCL INFO Setting affinity for GPU 1 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:915:915 [1] NCCL INFO NVLS multicast support is not available on dev 1
a49e15233991:917:917 [3] NCCL INFO comm 0xe44f7d0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
a49e15233991:915:915 [1] NCCL INFO comm 0xe33aaf0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
a49e15233991:914:914 [0] NCCL INFO comm 0xd878d20 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
a49e15233991:916:916 [2] NCCL INFO comm 0xd66b740 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
a49e15233991:917:917 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
a49e15233991:915:915 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
a49e15233991:914:914 [0] NCCL INFO Channel 00/04 : 0 1 2 3
a49e15233991:916:916 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
a49e15233991:917:917 [3] NCCL INFO P2P Chunksize set to 131072
a49e15233991:915:915 [1] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Channel 01/04 : 0 1 2 3
a49e15233991:916:916 [2] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Channel 02/04 : 0 1 2 3
a49e15233991:914:914 [0] NCCL INFO Channel 03/04 : 0 1 2 3
a49e15233991:914:914 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
a49e15233991:914:914 [0] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
a49e15233991:916:1018 [2] NCCL INFO [Proxy Service] Device 2 CPU core 2
a49e15233991:915:1022 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 113
a49e15233991:916:1021 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 18
a49e15233991:914:1019 [0] NCCL INFO [Proxy Service] Device 0 CPU core 30
a49e15233991:915:1017 [1] NCCL INFO [Proxy Service] Device 1 CPU core 111
a49e15233991:914:1023 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 19
a49e15233991:917:1016 [3] NCCL INFO [Proxy Service] Device 3 CPU core 10
a49e15233991:917:1020 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 91
a49e15233991:917:917 [3] NCCL INFO Channel 00/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 00/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 00/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 00/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 01/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 01/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 01/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 01/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 02/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 02/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 02/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 02/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 03/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 03/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 03/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 03/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:916:916 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:914:914 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:915:915 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:917:917 [3] NCCL INFO Channel 00/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 01/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 02/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 03/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 00/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 00/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 01/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 01/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 02/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 02/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 03/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 03/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Connected all trees
a49e15233991:914:914 [0] NCCL INFO Connected all trees
a49e15233991:916:916 [2] NCCL INFO Connected all trees
a49e15233991:915:915 [1] NCCL INFO Connected all trees
a49e15233991:916:1024 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 83
a49e15233991:916:916 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:916:916 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:915:1025 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 87
a49e15233991:915:915 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:915:915 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:917:1026 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 96
a49e15233991:917:917 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:917:917 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:914:1027 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 30
a49e15233991:914:914 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:914:914 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:914:914 [0] NCCL INFO CC Off, workFifoBytes 1048576
a49e15233991:917:917 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:915:915 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:917:917 [3] NCCL INFO ncclCommInitRank comm 0xe44f7d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:915:915 [1] NCCL INFO ncclCommInitRank comm 0xe33aaf0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:917:917 [3] NCCL INFO Init timings - ncclCommInitRank: rank 3 nranks 4 total 0.69 (kernels 0.55, alloc 0.00, bootstrap 0.05, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.08, rest 0.00)
a49e15233991:915:915 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 4 total 0.69 (kernels 0.60, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.07, rest 0.00)
a49e15233991:914:914 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:914:914 [0] NCCL INFO ncclCommInitRank comm 0xd878d20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:914:914 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 4 total 0.72 (kernels 0.55, alloc 0.00, bootstrap 0.07, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.08, rest 0.02)
a49e15233991:916:916 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:916:916 [2] NCCL INFO ncclCommInitRank comm 0xd66b740 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 56000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:916:916 [2] NCCL INFO Init timings - ncclCommInitRank: rank 2 nranks 4 total 0.71 (kernels 0.56, alloc 0.00, bootstrap 0.05, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.07, rest 0.02)
DEBUG 08-20 17:18:10 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:20 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:30 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:40 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:50 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.

Topic		Replies	Views
I got "NCCL error" when launch LLMEngine with data parallel = 2 General	1	460	July 31, 2025
NCCL error across 2 machines 2x4GPUs need advice General	2	499	November 12, 2025
Following Qwen3.5 Usage Guide on H20 ,but can not host Qwen3.5-27B General	4	407	February 28, 2026
Added second 5090 and turne on tensor parallel 2 General	9	482	September 18, 2025
为什么显存足够却启动不了serve General	3	1214	July 30, 2025

Vllm启动时，日志卡在nccl相关部分，不继续往下

Related topics