为什么显存足够却启动不了serve

Jasper · July 30, 2025, 7:07am

我的设备：h20(95G)*4启动命令：

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/ \--served-model-name Qwen3-235B-A22B \--port 6668 \--host 0.0.0.0 \--tensor-parallel-size 4 \--distributed-executor-backend mp \--disable-custom-all-reduce \--dtype half \--trust-remote-code \--enable-chunked-prefill \--enable-prefix-caching \--disable-log-requests \--disable-log-stats \--load-format auto

报错日志

(SenseRL) root@5a80c4dad3b0:/# CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/ \--served-model-name Qwen3-235B-A22B \--port 6668 \--host 0.0.0.0 \--tensor-parallel-size 4 \--distributed-executor-backend mp \--disable-custom-all-reduce \--dtype half \--trust-remote-code \--enable-chunked-prefill \--enable-prefix-caching \--disable-log-requests \--disable-log-stats \--load-format auto
INFO 07-30 15:06:56 [__init__.py:243] Automatically detected platform cuda.
INFO 07-30 15:07:00 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 07-30 15:07:00 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 07-30 15:07:00 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 07-30 15:07:02 [api_server.py:1289] vLLM API server version 0.9.0
INFO 07-30 15:07:03 [cli_args.py:300] non-default args: {'host': '0.0.0.0', 'port': 6668, 'trust_remote_code': True, 'dtype': 'half', 'served_model_name': ['Qwen3-235B-A22B'], 'distributed_executor_backend': 'mp', 'tensor_parallel_size': 4, 'disable_custom_all_reduce': True, 'enable_prefix_caching': True, 'enable_chunked_prefill': True, 'disable_log_stats': True, 'disable_log_requests': True}
WARNING 07-30 15:07:03 [config.py:3135] Casting torch.bfloat16 to torch.float16.
INFO 07-30 15:07:10 [config.py:793] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed', 'score'}. Defaulting to 'generate'.
INFO 07-30 15:07:10 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-30 15:07:14 [__init__.py:243] Automatically detected platform cuda.
INFO 07-30 15:07:17 [core.py:438] Waiting for init message from front-end.
INFO 07-30 15:07:17 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 07-30 15:07:17 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 07-30 15:07:17 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 07-30 15:07:17 [core.py:65] Initializing a V1 LLM engine (v0.9.0) with config: model='/mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/', speculative_config=None, tokenizer='/mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-235B-A22B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}
WARNING 07-30 15:07:17 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-30 15:07:17 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_1bced2b8'), local_subscribe_addr='ipc:///tmp/a61ae567-695a-403b-bc31-553ebcd8d26f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-30 15:07:20 [__init__.py:243] Automatically detected platform cuda.
INFO 07-30 15:07:20 [__init__.py:243] Automatically detected platform cuda.
INFO 07-30 15:07:20 [__init__.py:243] Automatically detected platform cuda.
INFO 07-30 15:07:21 [__init__.py:243] Automatically detected platform cuda.
INFO 07-30 15:07:25 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 07-30 15:07:25 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 07-30 15:07:25 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
WARNING 07-30 15:07:25 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb8dd2b1910>
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:25 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3cbe2f01'), local_subscribe_addr='ipc:///tmp/719d0b82-f680-42d7-af52-c2787a3027a2', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-30 15:07:26 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 07-30 15:07:26 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 07-30 15:07:26 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 07-30 15:07:26 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 07-30 15:07:26 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 07-30 15:07:26 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 07-30 15:07:26 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 07-30 15:07:26 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 07-30 15:07:26 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
WARNING 07-30 15:07:26 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fddb381cb10>
(VllmWorker rank=3 pid=445621) INFO 07-30 15:07:26 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a60dcafd'), local_subscribe_addr='ipc:///tmp/5630b57d-c07e-4ae6-887c-d6461888d04e', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 07-30 15:07:26 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fe739aa7e10>
(VllmWorker rank=1 pid=445619) INFO 07-30 15:07:26 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8b7998e7'), local_subscribe_addr='ipc:///tmp/5d43fe4e-8db0-4e3d-b30a-76ef39676483', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 07-30 15:07:26 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f0d8cc902d0>
(VllmWorker rank=2 pid=445620) INFO 07-30 15:07:26 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9feafae8'), local_subscribe_addr='ipc:///tmp/db508776-8d74-482b-8786-e142132362fe', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:29 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=445619) INFO 07-30 15:07:29 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=445620) INFO 07-30 15:07:29 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=445621) INFO 07-30 15:07:29 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=445620) INFO 07-30 15:07:29 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=445619) INFO 07-30 15:07:29 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:29 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=445621) INFO 07-30 15:07:29 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:33 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_980f6fb3'), local_subscribe_addr='ipc:///tmp/30292a27-adf4-4e05-9328-0358f3051928', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:33 [parallel_state.py:1064] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=2 pid=445620) INFO 07-30 15:07:33 [parallel_state.py:1064] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker rank=3 pid=445621) INFO 07-30 15:07:33 [parallel_state.py:1064] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker rank=1 pid=445619) INFO 07-30 15:07:33 [parallel_state.py:1064] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=445618) WARNING 07-30 15:07:33 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=2 pid=445620) WARNING 07-30 15:07:33 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=445621) WARNING 07-30 15:07:33 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=445619) WARNING 07-30 15:07:33 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=445619) INFO 07-30 15:07:33 [gpu_model_runner.py:1531] Starting to load model /mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/...
(VllmWorker rank=3 pid=445621) INFO 07-30 15:07:33 [gpu_model_runner.py:1531] Starting to load model /mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/...
(VllmWorker rank=2 pid=445620) INFO 07-30 15:07:33 [gpu_model_runner.py:1531] Starting to load model /mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/...
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:33 [gpu_model_runner.py:1531] Starting to load model /mnt/afs/share_models/git_models/Qwen/Qwen3-235B-A22B/...
(VllmWorker rank=3 pid=445621) INFO 07-30 15:07:33 [cuda.py:217] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=445619) INFO 07-30 15:07:33 [cuda.py:217] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=445618) INFO 07-30 15:07:33 [cuda.py:217] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=445620) INFO 07-30 15:07:33 [cuda.py:217] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.worker.load_model()
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 164, in load_model
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.model_runner.load_model()
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1534, in load_model
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     return loader.load_model(vllm_config=vllm_config,
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/model_loader/default_loader.py", line 273, in load_model
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/model_loader/utils.py", line 61, in initialize_model
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3_moe.py", line 497, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.model = Qwen3MoeModel(vllm_config=vllm_config,
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 151, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3_moe.py", line 333, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.start_layer, self.end_layer, self.layers = make_layers(
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                                                     ^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 625, in make_layers
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     [PPMissingLayer() for _ in range(start_layer)] + [
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                                                      ^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 626, in <listcomp>
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3_moe.py", line 335, in <lambda>
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     lambda prefix: Qwen3MoeDecoderLayer(config=config,
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3_moe.py", line 277, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.mlp = Qwen3MoeSparseMoeBlock(config=config,
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/models/qwen3_moe.py", line 112, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.experts = FusedMoE(num_experts=config.num_experts,
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 829, in __init__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     self.quant_method.create_weights(layer=self, **moe_quant_params)
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 381, in create_weights
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     w13_weight = torch.nn.Parameter(torch.empty(
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]                                     ^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/torch/utils/_device.py", line 104, in __torch_function__
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=445621) ERROR 07-30 15:07:33 [multiproc_executor.py:487] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 3 has a total capacity of 95.00 GiB of which 600.06 MiB is free. Process 3617065 has 94.41 GiB memory in use. Of the allocated memory 93.14 GiB is allocated by PyTorch, and 3.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W730 15:07:35.990148347 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 07-30 15:07:36 [core.py:500] EngineCore failed to start.
ERROR 07-30 15:07:36 [core.py:500] Traceback (most recent call last):
ERROR 07-30 15:07:36 [core.py:500]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
ERROR 07-30 15:07:36 [core.py:500]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-30 15:07:36 [core.py:500]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 15:07:36 [core.py:500]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 07-30 15:07:36 [core.py:500]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-30 15:07:36 [core.py:500]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 07-30 15:07:36 [core.py:500]     self.model_executor = executor_class(vllm_config)
ERROR 07-30 15:07:36 [core.py:500]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 15:07:36 [core.py:500]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 07-30 15:07:36 [core.py:500]     self._init_executor()
ERROR 07-30 15:07:36 [core.py:500]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
ERROR 07-30 15:07:36 [core.py:500]     self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-30 15:07:36 [core.py:500]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 15:07:36 [core.py:500]   File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
ERROR 07-30 15:07:36 [core.py:500]     raise e from None
ERROR 07-30 15:07:36 [core.py:500] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
    raise e
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
    self.workers = WorkerProc.wait_for_ready(unready_workers)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
    raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
  File "/usr/local/lib/miniconda3/envs/SenseRL/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 56, in main
    args.dispatch_function(args)
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 42, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1324, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 153, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 185, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 157, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 123, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 734, in __init__
    super().__init__(
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 418, in __init__
    self._wait_for_engine_startup(output_address, parallel_config)
  File "/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 484, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/local/lib/miniconda3/envs/SenseRL/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
(SenseRL) root@5a80c4dad3b0:/#

RunLLM · July 30, 2025, 7:08am

你的日志显示 Qwen3-235B-A22B 在 4×H20(95G) 上启动时，WorkerProc 进程因 CUDA OOM（显存不足）报错，具体为“torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 3 has a total capacity of 95.00 GiB of which 600.06 MiB is free…”。这说明模型权重和相关缓存已占满所有显存，导致无法分配新内存，进程直接崩溃退出。相关文档明确指出，模型过大超出单卡/多卡总显存时会 OOM，需采取措施降低内存消耗。

建议尝试以下措施：1）降低 max_model_len（如 40960 改为 8192 或更低）；2）减少 max_num_seqs（批量数）；3）尝试模型量化（如 int4/8）；4）如支持，增大 tensor_parallel_size 或使用 expert parallelism；5）加 --gpu-memory-utilization 0.8 或更低，限制 vLLM 占用显存比例；6）如有 CPU 内存富余，可用 --cpu-offload-gb 适当 offload。具体参数和方法详见官方内存优化文档和常见OOM问题讨论。

需要更详细的排查和参数建议吗？

Sources:

Jasper · July 30, 2025, 9:33am

如果不设置max_model_len和max_num_seqs会怎么样

RunLLM · July 30, 2025, 9:33am

如果不显式设置 max_model_len，vLLM 会自动从模型 config 里推断 context 长度，通常为模型原生支持的最大 context window（如 Qwen3-235B-A22B 默认 40960）。如果不设置 max_num_seqs，默认值一般为 256（部分版本曾为 1024，后又回调至 256），即最大并发批量数。相关说明和源码逻辑均有体现。

这意味着：1）max_model_len 越大，单请求可用上下文越长，但显存消耗也越大；2）max_num_seqs 越大，批量并发能力越强，但同样会显著增加显存需求，易 OOM。默认值适合大部分场景，但大模型/显存有限时建议手动调小。官方建议如遇 OOM，优先调低这两个参数。

需要详细解释自动推断逻辑或显存占用计算方式吗？

Sources:

Topic		Replies	Views
torch.OutOfMemoryError: CUDA out of memory General	15	1631	March 3, 2026
Help with vLLM crashes General	1	763	December 16, 2025
Hosting Qwen 3.5 35B-A3B model Model Support	1	1227	February 25, 2026
Vllm启动时，日志卡在nccl相关部分，不继续往下 NVIDIA GPU Support	16	1581	April 8, 2026
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1222	May 8, 2025

为什么显存足够却启动不了serve

Related topics