Vllm启动时,日志卡在nccl相关部分,不继续往下

使用4张L20启动vllm模型服务

nvidia-smi

Wed Aug 20 16:45:37 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L20 Off | 00000000:4F:00.0 Off | 0 |
| N/A 30C P8 38W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA L20 Off | 00000000:52:00.0 Off | 0 |
| N/A 30C P8 37W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA L20 Off | 00000000:56:00.0 Off | 0 |
| N/A 29C P8 38W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA L20 Off | 00000000:57:00.0 Off | 0 |
| N/A 29C P8 37W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA GeForce RTX 4090 Off | 00000000:D1:00.0 Off | Off |
| 30% 27C P8 16W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:D5:00.0 Off | Off |
| 30% 28C P8 6W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:D6:00.0 Off | Off |
| 30% 29C P8 8W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+

启动命令为

export VLLM_ATTENTION_BACKEND=FLASHINFER

nohup python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0
–port 8000
–model /data-new/models/Qwen2.5-72B
–served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code
–gpu-memory-utilization 0.9
-tp 4
–rope-scaling ‘{“rope_type”:“yarn”,“factor”:4.0,“original_max_position_embeddings”:32768}’
–max-model-len 60000
–tool-call-parser hermes --enable-auto-tool-choice > vllm-$(date +%Y%m%d%H%M).log 2>&1 &

启动日志(卡在最后)

INFO 08-20 16:36:24 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:28 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 16:36:28 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘hermes’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘rope_scaling’: {‘rope_type’: ‘yarn’, ‘factor’: 4
.0, ‘original_max_position_embeddings’: 32768}, ‘max_model_len’: 60000, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 16:36:28 [config.py:533] Overriding HF config with {‘rope_scaling’: {‘rope_type’: ‘yarn’, ‘factor’: 4.0, ‘original_max_position_embeddings’: 32768}}
INFO 08-20 16:36:37 [config.py:823] This model supports multiple tasks: {‘embed’, ‘score’, ‘reward’, ‘classify’, ‘generate’}. Defaulting to ‘generate’.
INFO 08-20 16:36:37 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 16:36:37 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 16:36:40 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 08-20 16:36:41 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:44 [core.py:455] Waiting for init message from front-end.
INFO 08-20 16:36:44 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, over
ride_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization
=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=Obse
rvabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunk
ed_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“u
se_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408
,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_ca
pture_size”:512,“local_cache_dir”:null}
WARNING 08-20 16:36:44 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-20 16:36:44 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_43ed24d8’), local_subscribe_addr=‘ipc:///tmp/6f290814-6fe8-4351-826f-8d677ee04475’, remote_subs
cribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 08-20 16:36:47 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:47 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:48 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:48 [init.py:244] Automatically detected platform cuda.
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f900d182fc0>
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f81198e5760>
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f6be5664080>
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_fd6ff538’), local_subscribe_addr=‘ipc:///tmp/105e1c68-5c59-428f-9
6ac-685cafb6003f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_8b2f14e7’), local_subscribe_addr=‘ipc:///tmp/eb0e00e6-1fcc-4730-8
bdc-a812269fddd6’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_cac17c74’), local_subscribe_addr=‘ipc:///tmp/90e5f820-91e6-4039-9
90d-e2f19fa4562b’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7efef0d0d610>
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_0b441e55’), local_subscribe_addr=‘ipc:///tmp/3b6b10f7-aad4-4876-9
7a5-6bca082fc456’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2

查看GPU,算力占用100%

nvidia-smi

Wed Aug 20 16:48:02 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L20 Off | 00000000:4F:00.0 Off | 0 |
| N/A 34C P0 91W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA L20 Off | 00000000:52:00.0 Off | 0 |
| N/A 34C P0 90W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA L20 Off | 00000000:56:00.0 Off | 0 |
| N/A 34C P0 94W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA L20 Off | 00000000:57:00.0 Off | 0 |
| N/A 34C P0 93W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA GeForce RTX 4090 Off | 00000000:D1:00.0 Off | Off |
| 30% 28C P8 16W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:D5:00.0 Off | Off |
| 30% 28C P8 6W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:D6:00.0 Off | Off |
| 30% 30C P8 7W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 55050 C /usr/bin/python3 486MiB |
| 1 N/A N/A 55051 C /usr/bin/python3 486MiB |
| 2 N/A N/A 55052 C /usr/bin/python3 486MiB |
| 3 N/A N/A 55053 C /usr/bin/python3 486MiB |
±----------------------------------------------------------------------------------------+

根据你提供的日志,vLLM 已经成功启动并检测到 4 张 L20 显卡,且 tensor_parallel_size=4,模型分布在 4 张卡上,进程正常分配,显存占用约 495MiB,GPU 利用率 100%。这说明模型加载和分布式推理初始化是正常的,没有报错或异常卡死现象。你看到的高 GPU 利用率通常是模型初始化或编译 CUDA 图时的正常现象,属于预热阶段,后续会恢复正常,等待推理请求即可,无需担心。参考 参考

如果你后续发现推理速度慢或显存占用异常低,可以检查 max_model_len、gpu_memory_utilization、rope_scaling 等参数是否合理,或尝试降低 max_model_len 以减少显存压力。整体来看,当前配置和运行状态是符合 vLLM 多卡推理的推荐实践的。参考 参考

需要更详细的分析或有具体推理异常,请补充具体报错或现象。

Sources:

日志卡在这,不往下继续了,应该继续往下出现服务访问端口的信息才是部署完成

你的理解是正确的:vLLM 启动日志在 worker 进程初始化、NCCL 通信等信息输出后,还应继续输出“Uvicorn running on …”或“Application startup complete”等服务端口监听信息,才算真正完成部署并可对外提供 API 服务。如果日志卡在 NCCL、worker 初始化等阶段,迟迟没有出现端口监听或 API 启动提示,说明服务还未完全启动,可能卡在模型加载、CUDA 图编译或 NCCL 初始化等环节。参考

建议:

  1. 检查模型文件是否完整、路径是否正确,确保本地磁盘 IO 没有瓶颈。
  2. 检查 GPU 驱动、CUDA、NCCL 版本兼容性,必要时重启机器。
  3. 可加大日志级别(如 export VLLM_LOGGING_LEVEL=DEBUG),或用 nvidia-smi 观察显存和 GPU 利用率变化,判断是否卡在模型加载或 NCCL 初始化。
  4. 若长时间无响应,可尝试减少 max_model_len 或 tensor_parallel_size,排查资源瓶颈。参考

需要更详细排查步骤请告知。

Sources:

root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export VLLM_LOGGING_LEVEL=DEBUG
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0
–port 8000
–model /data-new/models/Qwen2.5-72B
–served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code
–gpu-memory-utilization 0.9
-tp 4
DEBUG 08-20 17:11:31 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:31 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:31 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:31 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:31 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:31 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:31 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:31 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:31 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:31 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:31 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:31 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:31 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:31 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:31 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:31 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:32 [utils.py:150] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-20 17:11:32 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:32 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:32 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:11:33 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 17:11:33 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 17:11:39 [config.py:823] This model supports multiple tasks: {‘embed’, ‘generate’, ‘reward’, ‘score’, ‘classify’}. Defaulting to ‘generate’.
DEBUG 08-20 17:11:39 [arg_utils.py:1600] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 08-20 17:11:39 [arg_utils.py:1607] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 08-20 17:11:40 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 17:11:40 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 17:11:42 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:11:44 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:44 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:44 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:45 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:45 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:45 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:45 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:45 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:45 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:45 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:45 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:45 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:45 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:45 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:45 [init.py:244] Automatically detected platform cuda.
INFO 08-20 17:11:47 [core.py:455] Waiting for init message from front-end.
DEBUG 08-20 17:11:47 [utils.py:547] HELLO from local core engine process 0.
DEBUG 08-20 17:11:47 [core.py:463] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=[‘ipc:///tmp/f85c2ed1-c67c-4a02-842b-015f14fb9d67’], outputs=[‘ipc:///tmp/7e0e9a93-0af4-427a-9ce1-071e2ce62a5f’], coordinator_input=None, coordinator_output=None), parallel_config={‘data_parallel_master_ip’: ‘127.0.0.1’, ‘data_parallel_master_port’: 0, ‘data_parallel_size’: 1})
DEBUG 08-20 17:11:47 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:47 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:47 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:11:47 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null}
WARNING 08-20 17:11:47 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
DEBUG 08-20 17:11:47 [shm_broadcast.py:243] Binding to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
INFO 08-20 17:11:47 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_3a74c7b1’), local_subscribe_addr=‘ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:11:48 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:50 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:50 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:50 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:50 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:11:50 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:11:50 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:11:50 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:11:50 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:11:50 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:11:50 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:11:50 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:11:50 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:11:50 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:11:50 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:11:50 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:11:51 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:11:51 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:11:52 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:52 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:52 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb93e06d760>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=2 pid=582) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
(VllmWorker rank=2 pid=582) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/803fe796-7b50-4ed3-b1df-7e7e6abb558d
(VllmWorker rank=2 pid=582) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_2ca1a7e7’), local_subscribe_addr=‘ipc:///tmp/803fe796-7b50-4ed3-b1df-7e7e6abb558d’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:11:53 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:53 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:53 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:11:53 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:53 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:53 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
(VllmWorker rank=2 pid=582) DEBUG 08-20 17:11:53 [parallel_state.py:918] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffab1ecee70>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=0 pid=580) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
(VllmWorker rank=0 pid=580) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/2751cfa5-a07a-4696-b5b9-cc26dedfaa88
(VllmWorker rank=0 pid=580) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_86752b18’), local_subscribe_addr=‘ipc:///tmp/2751cfa5-a07a-4696-b5b9-cc26dedfaa88’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f80998dbec0>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=3 pid=583) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
DEBUG 08-20 17:11:53 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:11:53 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:11:53 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
(VllmWorker rank=3 pid=583) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/7c48f4e9-75f4-49d7-ad10-c1dc40291d4d
(VllmWorker rank=3 pid=583) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_e7b220b6’), local_subscribe_addr=‘ipc:///tmp/7c48f4e9-75f4-49d7-ad10-c1dc40291d4d’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:11:53 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:11:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fad21568b00>
DEBUG 08-20 17:11:53 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:11:53 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=1 pid=581) DEBUG 08-20 17:11:53 [shm_broadcast.py:313] Connecting to ipc:///tmp/66ab1609-f3f2-42b7-ad51-4223f25f81a0
(VllmWorker rank=0 pid=580) DEBUG 08-20 17:11:53 [parallel_state.py:918] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
(VllmWorker rank=1 pid=581) DEBUG 08-20 17:11:53 [shm_broadcast.py:243] Binding to ipc:///tmp/32902809-7db0-497d-a4f3-59cfa109bfcd
(VllmWorker rank=1 pid=581) INFO 08-20 17:11:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_e6d6d90e’), local_subscribe_addr=‘ipc:///tmp/32902809-7db0-497d-a4f3-59cfa109bfcd’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=583) DEBUG 08-20 17:11:53 [parallel_state.py:918] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
(VllmWorker rank=1 pid=581) DEBUG 08-20 17:11:54 [parallel_state.py:918] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:39641 backend=nccl
(VllmWorker rank=1 pid=581) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=581) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=583) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=583) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=582) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=580) INFO 08-20 17:11:54 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=582) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=580) INFO 08-20 17:11:54 [pynccl.py:70] vLLM is using nccl==2.26.2
DEBUG 08-20 17:11:57 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:07 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:17 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:27 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:37 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:47 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:12:57 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.

你的日志显示 vLLM 在“Waiting for 1 local, 0 remote core engine proc(s) to start.”这一步卡住,且持续输出该信息,没有进入模型权重加载、端口监听或 API 启动阶段。这说明后端 EngineCore 进程未能正常启动,常见原因如下:

  1. 命令行参数中存在非法字符:你的命令行参数 --port--model 等前的短横线(-)被替换成了全角字符(–),应全部改为英文半角短横线(-),否则参数无法被正确解析,进程会异常卡死。参考
  2. 模型路径或权重文件异常:如模型路径错误、权重文件不完整,也会导致 EngineCore 无法启动。
  3. 资源不足或 NCCL 初始化问题:如显存不足、NCCL 配置异常等,也会导致进程卡住。参考

建议:

  • 检查并修正所有命令行参数,确保全部为英文半角短横线(-),如 --port 8000 --model ...
  • 检查模型路径和权重文件完整性。
  • 若问题依旧,尝试加大日志级别,或用 export NCCL_DEBUG=INFO 获取更多 NCCL 日志,排查 NCCL 初始化问题。

如需进一步排查,可补充 nohup.outvllm-*.log 的后续内容。

Sources:

root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export VLLM_LOGGING_LEVEL=DEBUG
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# export NCCL_DEBUG=INFO
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20#
root@a49e15233991:/data-new/app/Qwen2.5-72B/0123/v091/L20# python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /data-new/models/Qwen2.5-72B --served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code --gpu-memory-utilization 0.9 -tp 4
DEBUG 08-20 17:17:44 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:17:44 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:17:44 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:17:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:17:44 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:17:44 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:17:44 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:17:44 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:17:44 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:17:44 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:17:44 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:17:44 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:44 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:17:44 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:17:45 [utils.py:150] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 08-20 17:17:45 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:17:45 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:17:45 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:17:46 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 17:17:46 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 17:17:53 [config.py:823] This model supports multiple tasks: {‘embed’, ‘generate’, ‘score’, ‘classify’, ‘reward’}. Defaulting to ‘generate’.
DEBUG 08-20 17:17:53 [arg_utils.py:1600] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 08-20 17:17:53 [arg_utils.py:1607] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 08-20 17:17:53 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 17:17:53 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 17:17:56 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:17:57 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:17:57 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:17:57 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:17:57 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:17:57 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:17:57 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:17:57 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:17:57 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:17:57 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:17:57 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:17:57 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:17:57 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:17:57 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:17:57 [init.py:244] Automatically detected platform cuda.
INFO 08-20 17:18:00 [core.py:455] Waiting for init message from front-end.
DEBUG 08-20 17:18:00 [utils.py:547] HELLO from local core engine process 0.
DEBUG 08-20 17:18:00 [core.py:463] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=[‘ipc:///tmp/5d49e0ce-dfe9-49b2-a5f4-fe250cc30826’], outputs=[‘ipc:///tmp/66beb4ef-4061-4daa-b693-7e3fb030334e’], coordinator_input=None, coordinator_output=None), parallel_config={‘data_parallel_master_ip’: ‘127.0.0.1’, ‘data_parallel_master_port’: 0, ‘data_parallel_size’: 1})
DEBUG 08-20 17:18:00 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:18:00 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:18:00 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-20 17:18:00 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:512,“local_cache_dir”:null}
WARNING 08-20 17:18:00 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
DEBUG 08-20 17:18:00 [shm_broadcast.py:243] Binding to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
INFO 08-20 17:18:00 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_8fd26fcc’), local_subscribe_addr=‘ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 17:18:01 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
DEBUG 08-20 17:18:03 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:18:03 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:18:03 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:18:03 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 08-20 17:18:03 [init.py:121] Checking if HPU platform is available.
DEBUG 08-20 17:18:03 [init.py:128] HPU platform is not available because habana_frameworks is not found.
DEBUG 08-20 17:18:03 [init.py:138] Checking if XPU platform is available.
DEBUG 08-20 17:18:03 [init.py:148] XPU platform is not available because: No module named ‘intel_extension_for_pytorch’
DEBUG 08-20 17:18:03 [init.py:155] Checking if CPU platform is available.
DEBUG 08-20 17:18:03 [init.py:177] Checking if Neuron platform is available.
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
INFO 08-20 17:18:03 [init.py:244] Automatically detected platform cuda.
DEBUG 08-20 17:18:03 [init.py:31] No plugins for group vllm.platform_plugins found.
DEBUG 08-20 17:18:03 [init.py:35] Checking if TPU platform is available.
DEBUG 08-20 17:18:03 [init.py:45] TPU platform is not available because: No module named ‘libtpu’
DEBUG 08-20 17:18:03 [init.py:52] Checking if CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:72] Confirmed CUDA platform is available.
DEBUG 08-20 17:18:03 [init.py:100] Checking if ROCm platform is available.
DEBUG 08-20 17:18:03 [init.py:114] ROCm platform is not available because: No module named ‘amdsmi’

WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f23686a0050>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=1 pid=915) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/e3ed3031-b88c-4940-9e31-f54185abae0f
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_1dc59dc6’), local_subscribe_addr=‘ipc:///tmp/e3ed3031-b88c-4940-9e31-f54185abae0f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f421b5e73b0>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/e26551fd-33d3-48d0-b78a-a58e8757c0c8
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_bf4a9042’), local_subscribe_addr=‘ipc:///tmp/e26551fd-33d3-48d0-b78a-a58e8757c0c8’, remote_subscribe_addr=None, remote_addr_ipv6=False)
DEBUG 08-20 17:18:06 [init.py:39] Available plugins for group vllm.general_plugins:
DEBUG 08-20 17:18:06 [init.py:41] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 08-20 17:18:06 [init.py:44] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
DEBUG 08-20 17:18:06 [decorators.py:110] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama_eagle3.LlamaModel’>: [‘input_ids’, ‘positions’, ‘hidden_states’]
(VllmWorker rank=0 pid=914) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
WARNING 08-20 17:18:06 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9d17a2d760>
DEBUG 08-20 17:18:06 [config.py:4677] enabled custom ops: Counter()
DEBUG 08-20 17:18:06 [config.py:4679] disabled custom ops: Counter()
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:06 [shm_broadcast.py:313] Connecting to ipc:///tmp/4bf3eb16-665f-41f4-a101-8835fa259410
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:06 [shm_broadcast.py:243] Binding to ipc:///tmp/aa38f861-f5b5-490c-98d2-8494edb22415
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_51d13c38’), local_subscribe_addr=‘ipc:///tmp/aa38f861-f5b5-490c-98d2-8494edb22415’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=916) DEBUG 08-20 17:18:06 [parallel_state.py:918] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=3 pid=917) DEBUG 08-20 17:18:07 [parallel_state.py:918] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:56771 backend=nccl
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=916) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=917) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=915) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=915) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:07 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=914) INFO 08-20 17:18:07 [pynccl.py:70] vLLM is using nccl==2.26.2
a49e15233991:914:914 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO cudaDriverVersion 12080
a49e15233991:914:914 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:916:916 [2] NCCL INFO cudaDriverVersion 12080
a49e15233991:916:916 [2] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:917:917 [3] NCCL INFO cudaDriverVersion 12080
a49e15233991:917:917 [3] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:915:915 [1] NCCL INFO cudaDriverVersion 12080
a49e15233991:915:915 [1] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
a49e15233991:914:914 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:914:914 [0] NCCL INFO NET/IB : No device found.
a49e15233991:914:914 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:914:914 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:914:914 [0] NCCL INFO Using network Socket
a49e15233991:914:914 [0] NCCL INFO ncclCommInitRank comm 0xd878d20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5d717172769c2522 - Init START
a49e15233991:917:917 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:916:916 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:917:917 [3] NCCL INFO NET/IB : No device found.
a49e15233991:917:917 [3] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO NET/IB : No device found.
a49e15233991:916:916 [2] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:917:917 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:917:917 [3] NCCL INFO Using network Socket
a49e15233991:916:916 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:916:916 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:916:916 [2] NCCL INFO Using network Socket
a49e15233991:917:917 [3] NCCL INFO ncclCommInitRank comm 0xe44f7d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0x5d717172769c2522 - Init START
a49e15233991:916:916 [2] NCCL INFO ncclCommInitRank comm 0xd66b740 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 56000 commId 0x5d717172769c2522 - Init START
a49e15233991:917:917 [3] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:915:915 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
a49e15233991:915:915 [1] NCCL INFO NET/IB : No device found.
a49e15233991:915:915 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
a49e15233991:915:915 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
a49e15233991:915:915 [1] NCCL INFO Using network Socket
a49e15233991:915:915 [1] NCCL INFO ncclCommInitRank comm 0xe33aaf0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d717172769c2522 - Init START
a49e15233991:914:914 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:915:915 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:916:916 [2] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
a49e15233991:916:916 [2] NCCL INFO Bootstrap timings total 0.048889 (create 0.000016, send 0.000055, recv 0.000041, ring 0.000035, delay 0.000000)
a49e15233991:915:915 [1] NCCL INFO Bootstrap timings total 0.002144 (create 0.000020, send 0.000069, recv 0.000270, ring 0.001528, delay 0.000000)
a49e15233991:917:917 [3] NCCL INFO Bootstrap timings total 0.049082 (create 0.000017, send 0.000053, recv 0.000067, ring 0.048548, delay 0.000000)
a49e15233991:916:916 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO Bootstrap timings total 0.065509 (create 0.000022, send 0.000062, recv 0.063531, ring 0.001624, delay 0.000000)
a49e15233991:915:915 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:917:917 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
a49e15233991:914:914 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:914:914 [0] NCCL INFO NVLS multicast support is not available on dev 0
a49e15233991:917:917 [3] NCCL INFO Setting affinity for GPU 3 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:917:917 [3] NCCL INFO NVLS multicast support is not available on dev 3
a49e15233991:916:916 [2] NCCL INFO Setting affinity for GPU 2 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:916:916 [2] NCCL INFO NVLS multicast support is not available on dev 2
a49e15233991:915:915 [1] NCCL INFO Setting affinity for GPU 1 to 03ffff,fffff000,0000003f,ffffffff
a49e15233991:915:915 [1] NCCL INFO NVLS multicast support is not available on dev 1
a49e15233991:917:917 [3] NCCL INFO comm 0xe44f7d0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
a49e15233991:915:915 [1] NCCL INFO comm 0xe33aaf0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
a49e15233991:914:914 [0] NCCL INFO comm 0xd878d20 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
a49e15233991:916:916 [2] NCCL INFO comm 0xd66b740 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
a49e15233991:917:917 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
a49e15233991:915:915 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
a49e15233991:914:914 [0] NCCL INFO Channel 00/04 : 0 1 2 3
a49e15233991:916:916 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
a49e15233991:917:917 [3] NCCL INFO P2P Chunksize set to 131072
a49e15233991:915:915 [1] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Channel 01/04 : 0 1 2 3
a49e15233991:916:916 [2] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Channel 02/04 : 0 1 2 3
a49e15233991:914:914 [0] NCCL INFO Channel 03/04 : 0 1 2 3
a49e15233991:914:914 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
a49e15233991:914:914 [0] NCCL INFO P2P Chunksize set to 131072
a49e15233991:914:914 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
a49e15233991:916:1018 [2] NCCL INFO [Proxy Service] Device 2 CPU core 2
a49e15233991:915:1022 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 113
a49e15233991:916:1021 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 18
a49e15233991:914:1019 [0] NCCL INFO [Proxy Service] Device 0 CPU core 30
a49e15233991:915:1017 [1] NCCL INFO [Proxy Service] Device 1 CPU core 111
a49e15233991:914:1023 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 19
a49e15233991:917:1016 [3] NCCL INFO [Proxy Service] Device 3 CPU core 10
a49e15233991:917:1020 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 91
a49e15233991:917:917 [3] NCCL INFO Channel 00/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 00/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 00/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 00/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 01/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 01/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 01/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 01/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 02/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 02/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 02/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 02/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 03/0 : 3[3] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 03/0 : 2[2] → 3[3] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 03/0 : 1[1] → 2[2] via P2P/IPC
a49e15233991:914:914 [0] NCCL INFO Channel 03/0 : 0[0] → 1[1] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:916:916 [2] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:914:914 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:915:915 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
a49e15233991:917:917 [3] NCCL INFO Channel 00/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 01/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 02/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Channel 03/0 : 3[3] → 2[2] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 00/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 00/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 01/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 01/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 02/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 02/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:916:916 [2] NCCL INFO Channel 03/0 : 2[2] → 1[1] via P2P/IPC
a49e15233991:915:915 [1] NCCL INFO Channel 03/0 : 1[1] → 0[0] via P2P/IPC
a49e15233991:917:917 [3] NCCL INFO Connected all trees
a49e15233991:914:914 [0] NCCL INFO Connected all trees
a49e15233991:916:916 [2] NCCL INFO Connected all trees
a49e15233991:915:915 [1] NCCL INFO Connected all trees
a49e15233991:916:1024 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 83
a49e15233991:916:916 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:916:916 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:915:1025 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 87
a49e15233991:915:915 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:915:915 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:917:1026 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 96
a49e15233991:917:917 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:917:917 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:914:1027 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 30
a49e15233991:914:914 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
a49e15233991:914:914 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
a49e15233991:914:914 [0] NCCL INFO CC Off, workFifoBytes 1048576
a49e15233991:917:917 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:915:915 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:917:917 [3] NCCL INFO ncclCommInitRank comm 0xe44f7d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:915:915 [1] NCCL INFO ncclCommInitRank comm 0xe33aaf0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 52000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:917:917 [3] NCCL INFO Init timings - ncclCommInitRank: rank 3 nranks 4 total 0.69 (kernels 0.55, alloc 0.00, bootstrap 0.05, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.08, rest 0.00)
a49e15233991:915:915 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 4 total 0.69 (kernels 0.60, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.07, rest 0.00)
a49e15233991:914:914 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:914:914 [0] NCCL INFO ncclCommInitRank comm 0xd878d20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:914:914 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 4 total 0.72 (kernels 0.55, alloc 0.00, bootstrap 0.07, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.08, rest 0.02)
a49e15233991:916:916 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
a49e15233991:916:916 [2] NCCL INFO ncclCommInitRank comm 0xd66b740 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 56000 commId 0x5d717172769c2522 - Init COMPLETE
a49e15233991:916:916 [2] NCCL INFO Init timings - ncclCommInitRank: rank 2 nranks 4 total 0.71 (kernels 0.56, alloc 0.00, bootstrap 0.05, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.07, rest 0.02)
DEBUG 08-20 17:18:10 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:20 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:30 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:40 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 08-20 17:18:50 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.

你的 NCCL 日志显示多卡通信初始化(ncclCommInitRank)全部完成,NCCL 没有报错,网络和 P2P 通道都已建立,说明 NCCL 环节没有问题。但 vLLM 依然卡在 “Waiting for 1 local, 0 remote core engine proc(s) to start.”,没有进入模型权重加载和服务端口监听阶段。这种情况常见于以下几种原因:

  1. 模型权重文件过大或不完整:Qwen2.5-72B 体积极大,若权重文件不完整、损坏或 IO 速度慢,会导致进程长时间卡住。请确认 /data-new/models/Qwen2.5-72B 下所有权重文件完整,且磁盘 IO 正常。参考
  2. 内存/显存不足:72B 模型对 CPU 内存和 GPU 显存要求极高,若资源不足会卡死。建议监控 free -hnvidia-smi,确保有足够内存和显存。
  3. 容器 /dev/shm 空间不足:vLLM 多卡推理依赖共享内存,容器默认 /dev/shm 可能太小。建议启动容器时加 --shm-size=16G 或更大。参考
  4. 模型路径权限或格式问题:确保模型路径无权限问题,且所有权重为 safetensors 或 bin 格式,且 transformers 版本兼容。
  5. vLLM 版本与 CUDA/NCCL 兼容性:你用的是 vLLM 0.9.1 + CUDA 12.8 + NCCL 2.26.2,理论兼容,但建议升级 vLLM 至最新版,或尝试 PyTorch 2.3+。

排查建议:

  • 检查 /data-new/models/Qwen2.5-72B 权重文件数量和完整性。
  • 监控 CPU 内存、GPU 显存和 /dev/shm 空间。
  • 若在容器内,重启容器并加大 --shm-size
  • 若长时间无响应,可尝试用小模型(如 Qwen2.5-7B)测试,排除环境问题。
  • 查看 nohup.log 或 vllm-*.log 是否有更详细的报错。

如需进一步定位,可加 export VLLM_LOGGING_LEVEL=DEBUG 并贴出更后面的日志。

Sources:

[Bug]: vLLM hangs forever on waiting engine process to start · Issue #17676 · vllm-project/vllm · GitHub 根据这个issue中的方法,export NCCL_P2P_DISABLE=1 可以用 L20 拉起服务了,但我在这台服务器上使用 4090 算力卡拉起服务时 不需要设置这个环境变量。 这个环境变量的作用是什么、有什么影响? L20 为什么要设置?

export NCCL_P2P_DISABLE=1 的作用是禁用 NCCL 的 GPU 间 P2P(Peer-to-Peer)通信,强制多卡间数据通过 CPU 内存中转,而不是直接走 GPU 间高速通道。这样做可以绕过部分硬件/驱动/拓扑不支持或 NCCL P2P 初始化异常导致的服务卡死问题。官方文档说明

L20 这类数据中心卡通常没有 NVLink,PCIe 拓扑复杂,P2P 支持有限或不稳定,NCCL 初始化时容易 hang 住,所以需要设置该变量。而 4090 桌面卡在同一主板、PCIe 拓扑简单、P2P 支持较好时,NCCL 能自动检测并正常初始化,无需禁用 P2P。相关讨论

影响:禁用 P2P 后,多卡推理时通信带宽下降,性能略有损失(一般 5~10%),但能保证服务稳定启动。建议仅在遇到 NCCL hang、服务卡死时使用。

Sources:

如果服务器上只有4090算力卡,在vllm启动时不需要加上exporter NCCL_P2P_DISABLE=1,但如果服务器上还有其它型号的算力卡(如L20),那么指定device_id使用4090算力卡时(还是那几张算力卡),就需要加上exporter NCCL_P2P_DISABLE=1,不然就会卡在nccl部分。这是为什么

是的,如果服务器上既有 4090,也有 L20 等不同型号的显卡,即使你只用 CUDA_VISIBLE_DEVICES 指定 4090 卡来启动 vLLM,仍然建议加上 export NCCL_P2P_DISABLE=1,否则 NCCL 在初始化时会扫描所有物理 GPU,尝试建立 P2P 通道,遇到不兼容的卡(如 L20)时容易卡死或 hang 在 NCCL 初始化阶段。参考 参考

原因是 NCCL 的 P2P 能力检测是针对全局所有 GPU 进行的,不仅仅是你指定的几张卡。只要系统里有 P2P 不兼容的卡存在,NCCL 初始化就有可能出问题。加上 NCCL_P2P_DISABLE=1 可以避免这种情况,保证服务稳定启动,哪怕你只用 4090。官方说明

Sources:

怎么理解“不兼容的卡”?现在的现象是,只要服务器上存在不同型号的卡,使用任意卡都要加上“export NCCL_P2P_DISABLE=1

“P2P 不兼容的卡”指的是:在同一台服务器上,不同型号(如 4090 和 L20)或不同架构的显卡之间,NCCL 尝试建立 GPU 直连(Peer-to-Peer)通信时,硬件、驱动或主板拓扑不支持,导致 NCCL 初始化 hang 死或报错。官方说明

NCCL 在初始化时会检测所有物理 GPU 的 P2P 能力,而不是只检测你指定的几张卡。只要服务器上存在一张与其它卡 P2P 不兼容的显卡(比如 L20 与 4090),即使你只用 4090,也必须加上 export NCCL_P2P_DISABLE=1,否则 NCCL 可能会卡死。这是因为 NCCL 的底层实现会全局扫描和尝试建立所有卡之间的 P2P 通道。相关讨论

Sources:

这几张卡(包含有2种型号),挪到了另一台服务器上,没有加上**export NCCL_P2P_DISABLE=1**又能正常运行vllm服务,P2P 还和什么有关

P2P(Peer-to-Peer)通信是指多张 GPU 之间直接通过高速通道(如 NVLink、PCIe)进行数据交换,不经过 CPU 内存。NCCL 默认会检测和启用 P2P,以提升多卡通信效率。官方说明

如果你在另一台服务器上,同样插有不同型号的显卡,但 vLLM 能正常启动且推理无异常,说明该服务器的硬件拓扑、驱动或主板设计允许 NCCL 自动规避不兼容的 P2P 通道,或所有卡间 P2P 能力都被正确识别和处理。此时即使不加 NCCL_P2P_DISABLE=1,NCCL 也能自动降级为 CPU 中转或只在兼容卡间启用 P2P,不会导致 hang 死。相关讨论

总结:P2P 只影响多卡间通信性能和 NCCL 初始化稳定性,不影响单卡推理。不同服务器表现差异,取决于硬件、驱动和主板对 P2P 的支持和 NCCL 的兼容性。

Sources: