使用4张L20启动vllm模型服务
nvidia-smi
Wed Aug 20 16:45:37 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L20 Off | 00000000:4F:00.0 Off | 0 |
| N/A 30C P8 38W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA L20 Off | 00000000:52:00.0 Off | 0 |
| N/A 30C P8 37W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA L20 Off | 00000000:56:00.0 Off | 0 |
| N/A 29C P8 38W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA L20 Off | 00000000:57:00.0 Off | 0 |
| N/A 29C P8 37W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA GeForce RTX 4090 Off | 00000000:D1:00.0 Off | Off |
| 30% 27C P8 16W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:D5:00.0 Off | Off |
| 30% 28C P8 6W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:D6:00.0 Off | Off |
| 30% 29C P8 8W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
启动命令为
export VLLM_ATTENTION_BACKEND=FLASHINFER
nohup python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0
–port 8000
–model /data-new/models/Qwen2.5-72B
–served-model-name /data-new/models/Qwen2.5-72B --trust-remote-code
–gpu-memory-utilization 0.9
-tp 4
–rope-scaling ‘{“rope_type”:“yarn”,“factor”:4.0,“original_max_position_embeddings”:32768}’
–max-model-len 60000
–tool-call-parser hermes --enable-auto-tool-choice > vllm-$(date +%Y%m%d%H%M).log 2>&1 &
启动日志(卡在最后)
INFO 08-20 16:36:24 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:28 [api_server.py:1287] vLLM API server version 0.9.1
INFO 08-20 16:36:28 [cli_args.py:309] non-default args: {‘host’: ‘0.0.0.0’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘hermes’, ‘model’: ‘/data-new/models/Qwen2.5-72B’, ‘trust_remote_code’: True, ‘rope_scaling’: {‘rope_type’: ‘yarn’, ‘factor’: 4
.0, ‘original_max_position_embeddings’: 32768}, ‘max_model_len’: 60000, ‘served_model_name’: [‘/data-new/models/Qwen2.5-72B’], ‘tensor_parallel_size’: 4}
INFO 08-20 16:36:28 [config.py:533] Overriding HF config with {‘rope_scaling’: {‘rope_type’: ‘yarn’, ‘factor’: 4.0, ‘original_max_position_embeddings’: 32768}}
INFO 08-20 16:36:37 [config.py:823] This model supports multiple tasks: {‘embed’, ‘score’, ‘reward’, ‘classify’, ‘generate’}. Defaulting to ‘generate’.
INFO 08-20 16:36:37 [config.py:1946] Defaulting to use mp for distributed inference
INFO 08-20 16:36:37 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-20 16:36:40 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 08-20 16:36:41 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:44 [core.py:455] Waiting for init message from front-end.
INFO 08-20 16:36:44 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model=‘/data-new/models/Qwen2.5-72B’, speculative_config=None, tokenizer=‘/data-new/models/Qwen2.5-72B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, over
ride_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization
=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=Obse
rvabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data-new/models/Qwen2.5-72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunk
ed_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:[“none”],“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”],“u
se_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[512,504,496,488,480,472,464,456,448,440,432,424,416,408
,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_ca
pture_size”:512,“local_cache_dir”:null}
WARNING 08-20 16:36:44 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 76 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-20 16:36:44 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, ‘psm_43ed24d8’), local_subscribe_addr=‘ipc:///tmp/6f290814-6fe8-4351-826f-8d677ee04475’, remote_subs
cribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
WARNING 08-20 16:36:45 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 08-20 16:36:47 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:47 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:48 [init.py:244] Automatically detected platform cuda.
INFO 08-20 16:36:48 [init.py:244] Automatically detected platform cuda.
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f900d182fc0>
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f81198e5760>
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f6be5664080>
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_fd6ff538’), local_subscribe_addr=‘ipc:///tmp/105e1c68-5c59-428f-9
6ac-685cafb6003f’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_8b2f14e7’), local_subscribe_addr=‘ipc:///tmp/eb0e00e6-1fcc-4730-8
bdc-a812269fddd6’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_cac17c74’), local_subscribe_addr=‘ipc:///tmp/90e5f820-91e6-4039-9
90d-e2f19fa4562b’, remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 08-20 16:36:51 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7efef0d0d610>
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, ‘psm_0b441e55’), local_subscribe_addr=‘ipc:///tmp/3b6b10f7-aad4-4876-9
7a5-6bca082fc456’, remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=248) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=246) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=247) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [utils.py:1126] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=245) INFO 08-20 16:36:51 [pynccl.py:70] vLLM is using nccl==2.26.2
查看GPU,算力占用100%
nvidia-smi
Wed Aug 20 16:48:02 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L20 Off | 00000000:4F:00.0 Off | 0 |
| N/A 34C P0 91W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA L20 Off | 00000000:52:00.0 Off | 0 |
| N/A 34C P0 90W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA L20 Off | 00000000:56:00.0 Off | 0 |
| N/A 34C P0 94W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA L20 Off | 00000000:57:00.0 Off | 0 |
| N/A 34C P0 93W / 350W | 495MiB / 46068MiB | 100% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA GeForce RTX 4090 Off | 00000000:D1:00.0 Off | Off |
| 30% 28C P8 16W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:D5:00.0 Off | Off |
| 30% 28C P8 6W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:D6:00.0 Off | Off |
| 30% 30C P8 7W / 450W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 55050 C /usr/bin/python3 486MiB |
| 1 N/A N/A 55051 C /usr/bin/python3 486MiB |
| 2 N/A N/A 55052 C /usr/bin/python3 486MiB |
| 3 N/A N/A 55053 C /usr/bin/python3 486MiB |
±----------------------------------------------------------------------------------------+