Vllm启动时，日志卡在nccl相关部分，不继续往下

XiaoDouGeGe · August 20, 2025, 8:57am

使用4张L20启动vllm模型服务

nvidia-smi

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+

启动命令为

export VLLM_ATTENTION_BACKEND=FLASHINFER

nohup python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0
–port 8000
–model /data-new/models/Qwen2.5-72B
–served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code
–gpu-memory-utilization 0.9
-tp 4
–rope-scaling ‘{“rope_type”:“yarn”,“factor”:4.0,“original_max_position_embeddings”:32768}’
–max-model-len 60000
–tool-call-parser hermes --enable-auto-tool-choice > vllm-$(date +%Y%m%d%H%M).log 2>&1 &

启动日志（卡在最后）

INFO 08-20 16:36:24 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:28 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 16:36:28 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘hermes’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘rope_scaling’: {‘rope_type’: ‘yarn’, ‘factor’: 4
.0, ‘original_max_position_embeddings’: 32768}, ‘max_model_len’: 60000, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 16:36:28 [config.py:533] Overriding HF config with {‘rope_scaling’: {‘rope_type’: ‘yarn’, ‘factor’: 4.0, ‘original_max_position_embeddings’: 32768}}
INFO 08-20 16:36:37 [config.py:823] This model supports multiple tasks: {‘embed’, ‘score’, ‘reward’, ‘classify’, ‘generate’}. Defaulting to ‘generate’.
INFO 08-20 16:36:37 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 16:36:37 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 16:36:40 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 08-20 16:36:41 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:44 [core.py:455] Waiting for init message from front-end.
INFO 08-20 16:36:44 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, over
ride_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization
=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=Obse
rvabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunk
ed_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“u
se_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408
,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_ca
pture_size”:512,“local_cache_dir”:null}
WARNING 08-20 16:36:44 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-20 16:36:44 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_43ed24d8’), local_subscribe_addr=‘ipc:///tmp/6f290814-6fe8-4351-826f-8d677ee04475’, remote_subs
cribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 08-20 16:36:47 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:47 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:48 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:48 [init.py:244] Automatically detected platform cuda.
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f900d182fc0>
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f81198e5760>
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f6be5664080>
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_fd6ff538’), local_subscribe_addr=‘ipc:///tmp/105e1c68-5c59-428f-9
6ac-685cafb6003f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_8b2f14e7’), local_subscribe_addr=‘ipc:///tmp/eb0e00e6-1fcc-4730-8
bdc-a812269fddd6’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_cac17c74’), local_subscribe_addr=‘ipc:///tmp/90e5f820-91e6-4039-9
90d-e2f19fa4562b’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7efef0d0d610>
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_0b441e55’), local_subscribe_addr=‘ipc:///tmp/3b6b10f7-aad4-4876-9
7a5-6bca082fc456’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2

查看GPU，算力占用100%

nvidia-smi

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 55050 C /usr/bin/python3 486MiB |
| 1 N/A N/A 55051 C /usr/bin/python3 486MiB |
| 2 N/A N/A 55052 C /usr/bin/python3 486MiB |
| 3 N/A N/A 55053 C /usr/bin/python3 486MiB |
±----------------------------------------------------------------------------------------+

RunLLM · August 20, 2025, 8:58am

根据你提供的日志，vLLM 已经成功启动并检测到 4 张 L20 显卡，且 tensor_parallel_size=4，模型分布在 4 张卡上，进程正常分配，显存占用约 495MiB，GPU 利用率 100%。这说明模型加载和分布式推理初始化是正常的，没有报错或异常卡死现象。你看到的高 GPU 利用率通常是模型初始化或编译 CUDA 图时的正常现象，属于预热阶段，后续会恢复正常，等待推理请求即可，无需担心。参考参考

如果你后续发现推理速度慢或显存占用异常低，可以检查 max_model_len、gpu_memory_utilization、rope_scaling 等参数是否合理，或尝试降低 max_model_len 以减少显存压力。整体来看，当前配置和运行状态是符合 vLLM 多卡推理的推荐实践的。参考参考

需要更详细的分析或有具体推理异常，请补充具体报错或现象。

Sources:

XiaoDouGeGe · August 20, 2025, 9:02am

日志卡在这，不往下继续了，应该继续往下出现服务访问端口的信息才是部署完成

RunLLM · August 20, 2025, 9:04am

你的理解是正确的：vLLM 启动日志在 worker 进程初始化、NCCL 通信等信息输出后，还应继续输出“Uvicorn running on …”或“Application startup complete”等服务端口监听信息，才算真正完成部署并可对外提供 API 服务。如果日志卡在 NCCL、worker 初始化等阶段，迟迟没有出现端口监听或 API 启动提示，说明服务还未完全启动，可能卡在模型加载、CUDA 图编译或 NCCL 初始化等环节。参考

建议：

检查模型文件是否完整、路径是否正确，确保本地磁盘 IO 没有瓶颈。
检查 GPU 驱动、CUDA、NCCL 版本兼容性，必要时重启机器。
可加大日志级别（如 export VLLM_LOGGING_LEVEL=DEBUG），或用 nvidia-smi 观察显存和 GPU 利用率变化，判断是否卡在模型加载或 NCCL 初始化。
若长时间无响应，可尝试减少 max_model_len 或 tensor_parallel_size，排查资源瓶颈。参考

需要更详细排查步骤请告知。

Sources:

https://docs.vllm.ai/en/latest/usage/troubleshooting.html#__span-2-4

XiaoDouGeGe · August 20, 2025, 9:13am

root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export VLLM_LOGGING_LEVEL=DEBUG
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0
–port 8000
–model /data-new/models/Qwen2.5-72B
–served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code
–gpu-memory-utilization 0.9
-tp 4
DEBUG 08-20 17:11:31 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:31 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:31 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:31 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:31 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:31 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:31 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:31 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:31 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:31 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:31 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:31 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:31 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:31 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:31 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:31 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:32 [utils.py:150] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-20 17:11:32 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:32 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:32 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:11:33 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 17:11:33 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 17:11:39 [config.py:823] This model supports multiple tasks: {‘embed’, ‘generate’, ‘reward’, ‘score’, ‘classify’}. Defaulting to ‘generate’.
DEBUG 08-20 17:11:39 [arg_utils.py:1600] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 08-20 17:11:39 [arg_utils.py:1607] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 08-20 17:11:40 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 17:11:40 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 17:11:42 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:11:44 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:44 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:44 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:45 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:45 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:45 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:45 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:45 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:45 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:45 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:45 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:45 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:45 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:45 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:45 [init.py:244] Automatically detected platform cuda.
INFO 08-20 17:11:47 [core.py:455] Waiting for init message from front-end.
DEBUG 08-20 17:11:47 [utils.py:547] HELLO from local core engine process 0.
DEBUG 08-20 17:11:47 [core.py:463] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=[‘ipc:///tmp/f85c2ed1-c67c-4a02-842b-015f14fb9d67’], outputs=[‘ipc:///tmp/7e0e9a93-0af4-427a-9ce1-071e2ce62a5f’], coordinator_input=None, coordinator_output=None), parallel_config={‘data_parallel_master_ip’: ‘127.0.0.1’, ‘data_parallel_master_port’: 0, ‘data_parallel_size’: 1})
DEBUG 08-20 17:11:47 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:47 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:47 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:11:47 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null}
WARNING 08-20 17:11:47 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
DEBUG 08-20 17:11:47 [shm_broadcast.py:243] Binding to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
INFO 08-20 17:11:47 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_3a74c7b1’), local_subscribe_addr=‘ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:50 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:50 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:50 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:51 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:51 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:52 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:52 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:52 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb93e06d760>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=2 pid=582) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
(VllmWorker rank=2 pid=582) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/803fe796-7b50-4ed3-b1df-7e7e6abb558d
(VllmWorker rank=2 pid=582) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_2ca1a7e7’), local_subscribe_addr=‘ipc:///tmp/803fe796-7b50-4ed3-b1df-7e7e6abb558d’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:11:53 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:53 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:53 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:11:53 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:53 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:53 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
(VllmWorker rank=2 pid=582) DEBUG 08-20 17:11:53 [parallel_state.py:918] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffab1ecee70>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=0 pid=580) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
(VllmWorker rank=0 pid=580) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/2751cfa5-a07a-4696-b5b9-cc26dedfaa88
(VllmWorker rank=0 pid=580) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_86752b18’), local_subscribe_addr=‘ipc:///tmp/2751cfa5-a07a-4696-b5b9-cc26dedfaa88’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f80998dbec0>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=3 pid=583) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
DEBUG 08-20 17:11:53 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:53 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:53 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
(VllmWorker rank=3 pid=583) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/7c48f4e9-75f4-49d7-ad10-c1dc40291d4d
(VllmWorker rank=3 pid=583) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_e7b220b6’), local_subscribe_addr=‘ipc:///tmp/7c48f4e9-75f4-49d7-ad10-c1dc40291d4d’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fad21568b00>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=1 pid=581) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
(VllmWorker rank=0 pid=580) DEBUG 08-20 17:11:53 [parallel_state.py:918] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
(VllmWorker rank=1 pid=581) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/32902809-7db0-497d-a4f3-59cfa109bfcd
(VllmWorker rank=1 pid=581) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_e6d6d90e’), local_subscribe_addr=‘ipc:///tmp/32902809-7db0-497d-a4f3-59cfa109bfcd’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=583) DEBUG 08-20 17:11:53 [parallel_state.py:918] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
(VllmWorker rank=1 pid=581) DEBUG 08-20 17:11:54 [parallel_state.py:918] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
(VllmWorker rank=1 pid=581) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=581) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=583) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=583) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=582) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=580) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=582) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=580) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
DEBUG 08-20 17:11:57 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:07 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:17 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:27 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:37 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:47 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:57 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.

RunLLM · August 20, 2025, 9:16am

你的日志显示 vLLM 在“Waiting for 1 local, 0 remote core engine proc(s) to start.”这一步卡住，且持续输出该信息，没有进入模型权重加载、端口监听或 API 启动阶段。这说明后端 EngineCore 进程未能正常启动，常见原因如下：

命令行参数中存在非法字符：你的命令行参数 --port、--model 等前的短横线（-）被替换成了全角字符（â），应全部改为英文半角短横线（-），否则参数无法被正确解析，进程会异常卡死。参考
模型路径或权重文件异常：如模型路径错误、权重文件不完整，也会导致 EngineCore 无法启动。
资源不足或 NCCL 初始化问题：如显存不足、NCCL 配置异常等，也会导致进程卡住。参考

建议：

检查并修正所有命令行参数，确保全部为英文半角短横线（-），如 --port 8000 --model ...。
检查模型路径和权重文件完整性。
若问题依旧，尝试加大日志级别，或用 export NCCL_DEBUG=INFO 获取更多 NCCL 日志，排查 NCCL 初始化问题。

如需进一步排查，可补充 nohup.out 或 vllm-*.log 的后续内容。

Sources:

https://github.com/vllm-project/vllm/issues/21051

XiaoDouGeGe · August 20, 2025, 9:21am

root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export VLLM_LOGGING_LEVEL=DEBUG
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export NCCL_DEBUG=INFO
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /data-new/models/Qwen2.5-72B --served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code --gpu-memory-utilization 0.9 -tp 4
DEBUG 08-20 17:17:44 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:17:44 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:17:44 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:17:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:17:44 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:17:44 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:17:44 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:17:44 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:17:44 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:17:44 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:17:44 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:17:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:17:44 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:17:45 [utils.py:150] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-20 17:17:45 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:17:45 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:17:45 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:17:46 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 17:17:46 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 17:17:53 [config.py:823] This model supports multiple tasks: {‘embed’, ‘generate’, ‘score’, ‘classify’, ‘reward’}. Defaulting to ‘generate’.
DEBUG 08-20 17:17:53 [arg_utils.py:1600] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 08-20 17:17:53 [arg_utils.py:1607] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 08-20 17:17:53 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 17:17:53 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 17:17:56 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:17:57 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:17:57 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:17:57 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:17:57 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:17:57 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:17:57 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:17:57 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:17:57 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:17:57 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:17:57 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:17:57 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:17:57 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:17:57 [init.py:244] Automatically detected platform cuda.
INFO 08-20 17:18:00 [core.py:455] Waiting for init message from front-end.
DEBUG 08-20 17:18:00 [utils.py:547] HELLO from local core engine process 0.
DEBUG 08-20 17:18:00 [core.py:463] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=[‘ipc:///tmp/5d49e0ce-dfe9-49b2-a5f4-fe250cc30826’], outputs=[‘ipc:///tmp/66beb4ef-4061-4daa-b693-7e3fb030334e’], coordinator_input=None, coordinator_output=None), parallel_config={‘data_parallel_master_ip’: ‘127.0.0.1’, ‘data_parallel_master_port’: 0, ‘data_parallel_size’: 1})
DEBUG 08-20 17:18:00 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:18:00 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:18:00 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:18:00 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null}
WARNING 08-20 17:18:00 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
DEBUG 08-20 17:18:00 [shm_broadcast.py:243] Binding to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
INFO 08-20 17:18:00 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_8fd26fcc’), local_subscribe_addr=‘ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:18:03 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:18:03 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:18:03 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:18:03 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:18:03 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:18:03 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:18:03 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:18:03 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:18:03 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:18:03 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:18:03 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:18:03 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:18:03 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:18:03 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:18:03 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
…
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f23686a0050>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=1 pid=915) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/e3ed3031-b88c-4940-9e31-f54185abae0f
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_1dc59dc6’), local_subscribe_addr=‘ipc:///tmp/e3ed3031-b88c-4940-9e31-f54185abae0f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f421b5e73b0>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/e26551fd-33d3-48d0-b78a-a58e8757c0c8
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_bf4a9042’), local_subscribe_addr=‘ipc:///tmp/e26551fd-33d3-48d0-b78a-a58e8757c0c8’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:18:06 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:18:06 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:18:06 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9d17a2d760>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/aa38f861-f5b5-490c-98d2-8494edb22415
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_51d13c38’), local_subscribe_addr=‘ipc:///tmp/aa38f861-f5b5-490c-98d2-8494edb22415’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:07 [parallel_state.py:918] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=915) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=915) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
a49e15233991:914:914 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO cudaDriverVersion 12080
a49e15233991:914:914 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:916:916 [2] NCCL INFO cudaDriverVersion 12080
a49e15233991:916:916 [2] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:917:917 [3] NCCL INFO cudaDriverVersion 12080
a49e15233991:917:917 [3] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:915:915 [1] NCCL INFO cudaDriverVersion 12080
a49e15233991:915:915 [1] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:914:914 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:914:914 [0] NCCL INFO NET/IB : No device found.
a49e15233991:914:914 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:914:914 [0] NCCL INFO Using network Socket
a49e15233991:914:914 [0] NCCL INFO ncclCommInitRank comm 0xd878d20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5d717172769c2522 - Init START
a49e15233991:917:917 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:916:916 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:917:917 [3] NCCL INFO NET/IB : No device found.
a49e15233991:917:917 [3] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO NET/IB : No device found.
a49e15233991:916:916 [2] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:917:917 [3] NCCL INFO Using network Socket
a49e15233991:916:916 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:916:916 [2] NCCL INFO Using network Socket
a49e15233991:917:917 [3] NCCL INFO ncclCommInitRank comm 0xe44f7d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0x5d717172769c2522 - Init START
a49e15233991:916:916 [2] NCCL INFO ncclCommInitRank comm 0xd66b740 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 56000 commId 0x5d717172769c2522 - Init START
a49e15233991:917:917 [3] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:915:915 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:915:915 [1] NCCL INFO NET/IB : No device found.
a49e15233991:915:915 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:915:915 [1] NCCL INFO Using network Socket
a49e15233991:915:915 [1] NCCL INFO ncclCommInitRank comm 0xe33aaf0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d717172769c2522 - Init START
a49e15233991:914:914 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:915:915 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:916:916 [2] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:916:916 [2] NCCL INFO Bootstrap timings total 0.048889 (create 0.000016, send 0.000055, recv 0.000041, ring 0.000035, delay 0.000000)
a49e15233991:915:915 [1] NCCL INFO Bootstrap timings total 0.002144 (create 0.000020, send 0.000069, recv 0.000270, ring 0.001528, delay 0.000000)
a49e15233991:917:917 [3] NCCL INFO Bootstrap timings total 0.049082 (create 0.000017, send 0.000053, recv 0.000067, ring 0.048548, delay 0.000000)
a49e15233991:916:916 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO Bootstrap timings total 0.065509 (create 0.000022, send 0.000062, recv 0.063531, ring 0.001624, delay 0.000000)
a49e15233991:915:915 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:917:917 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:914:914 [0] NCCL INFO NVLS multicast support is not available on dev 0
a49e15233991:917:917 [3] NCCL INFO Setting affinity for GPU 3 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:917:917 [3] NCCL INFO NVLS multicast support is not available on dev 3
a49e15233991:916:916 [2] NCCL INFO Setting affinity for GPU 2 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:916:916 [2] NCCL INFO NVLS multicast support is not available on dev 2
a49e15233991:915:915 [1] NCCL INFO Setting affinity for GPU 1 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:915:915 [1] NCCL INFO NVLS multicast support is not available on dev 1
a49e15233991:917:917 [3] NCCL INFO comm 0xe44f7d0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
a49e15233991:915:915 [1] NCCL INFO comm 0xe33aaf0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
a49e15233991:914:914 [0] NCCL INFO comm 0xd878d20 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
a49e15233991:916:916 [2] NCCL INFO comm 0xd66b740 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
a49e15233991:917:917 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
a49e15233991:915:915 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
a49e15233991:914:914 [0] NCCL INFO Channel 00/04 : 0 1 2 3
a49e15233991:916:916 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
a49e15233991:917:917 [3] NCCL INFO P2P Chunksize set to 131072
a49e15233991:915:915 [1] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Channel 01/04 : 0 1 2 3
a49e15233991:916:916 [2] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Channel 02/04 : 0 1 2 3
a49e15233991:914:914 [0] NCCL INFO Channel 03/04 : 0 1 2 3
a49e15233991:914:914 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
a49e15233991:914:914 [0] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
a49e15233991:916:1018 [2] NCCL INFO [Proxy Service] Device 2 CPU core 2
a49e15233991:915:1022 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 113
a49e15233991:916:1021 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 18
a49e15233991:914:1019 [0] NCCL INFO [Proxy Service] Device 0 CPU core 30
a49e15233991:915:1017 [1] NCCL INFO [Proxy Service] Device 1 CPU core 111
a49e15233991:914:1023 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 19
a49e15233991:917:1016 [3] NCCL INFO [Proxy Service] Device 3 CPU core 10
a49e15233991:917:1020 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 91
a49e15233991:917:917 [3] NCCL INFO Channel 00/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 00/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 00/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 00/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 01/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 01/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 01/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 01/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 02/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 02/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 02/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 02/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 03/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 03/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 03/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 03/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:916:916 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:914:914 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:915:915 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:917:917 [3] NCCL INFO Channel 00/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 01/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 02/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 03/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 00/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 00/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 01/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 01/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 02/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 02/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 03/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 03/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Connected all trees
a49e15233991:914:914 [0] NCCL INFO Connected all trees
a49e15233991:916:916 [2] NCCL INFO Connected all trees
a49e15233991:915:915 [1] NCCL INFO Connected all trees
a49e15233991:916:1024 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 83
a49e15233991:916:916 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:916:916 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:915:1025 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 87
a49e15233991:915:915 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:915:915 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:917:1026 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 96
a49e15233991:917:917 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:917:917 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:914:1027 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 30
a49e15233991:914:914 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:914:914 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:914:914 [0] NCCL INFO CC Off, workFifoBytes 1048576
a49e15233991:917:917 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:915:915 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:917:917 [3] NCCL INFO ncclCommInitRank comm 0xe44f7d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:915:915 [1] NCCL INFO ncclCommInitRank comm 0xe33aaf0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:917:917 [3] NCCL INFO Init timings - ncclCommInitRank: rank 3 nranks 4 total 0.69 (kernels 0.55, alloc 0.00, bootstrap 0.05, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.08, rest 0.00)
a49e15233991:915:915 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 4 total 0.69 (kernels 0.60, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.07, rest 0.00)
a49e15233991:914:914 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:914:914 [0] NCCL INFO ncclCommInitRank comm 0xd878d20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:914:914 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 4 total 0.72 (kernels 0.55, alloc 0.00, bootstrap 0.07, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.08, rest 0.02)
a49e15233991:916:916 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:916:916 [2] NCCL INFO ncclCommInitRank comm 0xd66b740 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 56000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:916:916 [2] NCCL INFO Init timings - ncclCommInitRank: rank 2 nranks 4 total 0.71 (kernels 0.56, alloc 0.00, bootstrap 0.05, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.07, rest 0.02)
DEBUG 08-20 17:18:10 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:20 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:30 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:40 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:50 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.

RunLLM · August 20, 2025, 9:22am

你的 NCCL 日志显示多卡通信初始化（ncclCommInitRank）全部完成，NCCL 没有报错，网络和 P2P 通道都已建立，说明 NCCL 环节没有问题。但 vLLM 依然卡在 “Waiting for 1 local, 0 remote core engine proc(s) to start.”，没有进入模型权重加载和服务端口监听阶段。这种情况常见于以下几种原因：

模型权重文件过大或不完整：Qwen2.5-72B 体积极大，若权重文件不完整、损坏或 IO 速度慢，会导致进程长时间卡住。请确认 /data-new/models/Qwen2.5-72B 下所有权重文件完整，且磁盘 IO 正常。参考
内存/显存不足：72B 模型对 CPU 内存和 GPU 显存要求极高，若资源不足会卡死。建议监控 free -h 和 nvidia-smi，确保有足够内存和显存。
容器 /dev/shm 空间不足：vLLM 多卡推理依赖共享内存，容器默认 /dev/shm 可能太小。建议启动容器时加 --shm-size=16G 或更大。参考
模型路径权限或格式问题：确保模型路径无权限问题，且所有权重为 safetensors 或 bin 格式，且 transformers 版本兼容。
vLLM 版本与 CUDA/NCCL 兼容性：你用的是 vLLM 0.9.1 + CUDA 12.8 + NCCL 2.26.2，理论兼容，但建议升级 vLLM 至最新版，或尝试 PyTorch 2.3+。

排查建议：

检查 /data-new/models/Qwen2.5-72B 权重文件数量和完整性。
监控 CPU 内存、GPU 显存和 /dev/shm 空间。
若在容器内，重启容器并加大 --shm-size。
若长时间无响应，可尝试用小模型（如 Qwen2.5-7B）测试，排除环境问题。
查看 nohup.log 或 vllm-*.log 是否有更详细的报错。

如需进一步定位，可加 export VLLM_LOGGING_LEVEL=DEBUG 并贴出更后面的日志。

Sources:

XiaoDouGeGe · August 21, 2025, 6:06am

[Bug]: vLLM hangs forever on waiting engine process to start · Issue #17676 · vllm-project/vllm · GitHub 根据这个issue中的方法，export NCCL_P2P_DISABLE=1 可以用 L20 拉起服务了，但我在这台服务器上使用 4090 算力卡拉起服务时不需要设置这个环境变量。这个环境变量的作用是什么、有什么影响？ L20 为什么要设置？

RunLLM · August 21, 2025, 6:07am

export NCCL_P2P_DISABLE=1 的作用是禁用 NCCL 的 GPU 间 P2P（Peer-to-Peer）通信，强制多卡间数据通过 CPU 内存中转，而不是直接走 GPU 间高速通道。这样做可以绕过部分硬件/驱动/拓扑不支持或 NCCL P2P 初始化异常导致的服务卡死问题。官方文档说明

L20 这类数据中心卡通常没有 NVLink，PCIe 拓扑复杂，P2P 支持有限或不稳定，NCCL 初始化时容易 hang 住，所以需要设置该变量。而 4090 桌面卡在同一主板、PCIe 拓扑简单、P2P 支持较好时，NCCL 能自动检测并正常初始化，无需禁用 P2P。相关讨论

影响：禁用 P2P 后，多卡推理时通信带宽下降，性能略有损失（一般 5~10%），但能保证服务稳定启动。建议仅在遇到 NCCL hang、服务卡死时使用。

Sources:

XiaoDouGeGe · August 27, 2025, 2:30am

如果服务器上只有4090算力卡，在vllm启动时不需要加上exporter NCCL_P2P_DISABLE=1，但如果服务器上还有其它型号的算力卡（如L20），那么指定device_id使用4090算力卡时（还是那几张算力卡），就需要加上exporter NCCL_P2P_DISABLE=1，不然就会卡在nccl部分。这是为什么

RunLLM · August 27, 2025, 2:31am

是的，如果服务器上既有 4090，也有 L20 等不同型号的显卡，即使你只用 CUDA_VISIBLE_DEVICES 指定 4090 卡来启动 vLLM，仍然建议加上 export NCCL_P2P_DISABLE=1，否则 NCCL 在初始化时会扫描所有物理 GPU，尝试建立 P2P 通道，遇到不兼容的卡（如 L20）时容易卡死或 hang 在 NCCL 初始化阶段。参考参考

原因是 NCCL 的 P2P 能力检测是针对全局所有 GPU 进行的，不仅仅是你指定的几张卡。只要系统里有 P2P 不兼容的卡存在，NCCL 初始化就有可能出问题。加上 NCCL_P2P_DISABLE=1 可以避免这种情况，保证服务稳定启动，哪怕你只用 4090。官方说明

Sources:

参考

XiaoDouGeGe · August 27, 2025, 3:15am

怎么理解“不兼容的卡”？现在的现象是，只要服务器上存在不同型号的卡，使用任意卡都要加上“export NCCL_P2P_DISABLE=1”

RunLLM · August 27, 2025, 3:15am

“P2P 不兼容的卡”指的是：在同一台服务器上，不同型号（如 4090 和 L20）或不同架构的显卡之间，NCCL 尝试建立 GPU 直连（Peer-to-Peer）通信时，硬件、驱动或主板拓扑不支持，导致 NCCL 初始化 hang 死或报错。官方说明

NCCL 在初始化时会检测所有物理 GPU 的 P2P 能力，而不是只检测你指定的几张卡。只要服务器上存在一张与其它卡 P2P 不兼容的显卡（比如 L20 与 4090），即使你只用 4090，也必须加上 export NCCL_P2P_DISABLE=1，否则 NCCL 可能会卡死。这是因为 NCCL 的底层实现会全局扫描和尝试建立所有卡之间的 P2P 通道。相关讨论

Sources:

XiaoDouGeGe · August 27, 2025, 9:12am

这几张卡（包含有2种型号），挪到了另一台服务器上，没有加上**export NCCL_P2P_DISABLE=1**又能正常运行vllm服务，P2P 还和什么有关

RunLLM · August 27, 2025, 9:13am

P2P（Peer-to-Peer）通信是指多张 GPU 之间直接通过高速通道（如 NVLink、PCIe）进行数据交换，不经过 CPU 内存。NCCL 默认会检测和启用 P2P，以提升多卡通信效率。官方说明

如果你在另一台服务器上，同样插有不同型号的显卡，但 vLLM 能正常启动且推理无异常，说明该服务器的硬件拓扑、驱动或主板设计允许 NCCL 自动规避不兼容的 P2P 通道，或所有卡间 P2P 能力都被正确识别和处理。此时即使不加 NCCL_P2P_DISABLE=1，NCCL 也能自动降级为 CPU 中转或只在兼容卡间启用 P2P，不会导致 hang 死。相关讨论

总结：P2P 只影响多卡间通信性能和 NCCL 初始化稳定性，不影响单卡推理。不同服务器表现差异，取决于硬件、驱动和主板对 P2P 的支持和 NCCL 的兼容性。

Sources:

https://github.com/vllm-project/vllm/issues/13273

Topic		Replies	Views
I got "NCCL error" when launch LLMEngine with data parallel = 2 General	1	157	July 31, 2025
Added second 5090 and turne on tensor parallel 2 General	9	74	September 18, 2025
为什么显存足够却启动不了serve General	3	423	July 30, 2025
Failed to run distribute Inference with vllm 0.8.2 General	6	579	April 27, 2025
vLLM does not work with 2x 5090 in tp 2 General	8	147	September 18, 2025

Vllm启动时，日志卡在nccl相关部分，不继续往下

nvidia-smi

nvidia-smi

Related topics