推理时报错RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

longkong39 · February 5, 2026, 3:46am

3090推理Qwen3-1.7B模型经常遇到这个问题

/root/miniconda3/envs/eval/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
INFO 02-05 11:42:21 [utils.py:261] non-default args: {'disable_log_stats': True, 'model': 'models/Qwen/Qwen3-1.7B'}
INFO 02-05 11:42:21 [model.py:541] Resolved architecture: Qwen3ForCausalLM
INFO 02-05 11:42:21 [model.py:1561] Using max model len 40960
2026-02-05 11:42:21,182	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO 02-05 11:42:21 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 02-05 11:42:21 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:21 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='models/Qwen/Qwen3-1.7B', speculative_config=None, tokenizer='models/Qwen/Qwen3-1.7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=models/Qwen/Qwen3-1.7B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:24 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.7:48773 backend=nccl
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:24 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:25 [gpu_model_runner.py:4033] Starting to load model models/Qwen/Qwen3-1.7B...
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:26 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.99it/s]
(EngineCore_DP0 pid=4537) 
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:27 [default_loader.py:291] Loading weights took 1.13 seconds
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:28 [gpu_model_runner.py:4130] Model loading took 3.22 GiB memory and 1.997281 seconds
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:37 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/02e558d41a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:37 [backends.py:872] Dynamo bytecode transform time: 8.88 s
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:43 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 0.919 s
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:43 [monitor.py:34] torch.compile takes 9.80 s in total
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:44 [gpu_worker.py:356] Available KV cache memory: 16.65 GiB
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:44 [kv_cache_utils.py:1307] GPU KV cache size: 155,904 tokens
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:44 [kv_cache_utils.py:1312] Maximum concurrency for 40,960 tokens per request: 3.81x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:02<00:00, 19.84it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 21.82it/s]
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:49 [gpu_model_runner.py:5063] Graph capturing finished in 5 secs, took 1.99 GiB
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4812, in _dummy_sampler_run
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     sampler_output = self.sampler(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]                      ^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 96, in forward
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     sampled, processed_logprobs = self.sample(logits, sampling_metadata)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 187, in sample
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     random_sampled, processed_logprobs = self.topk_topp_sampler(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]                                          ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_native
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     logits = self.apply_top_k_top_p(logits, k, p)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 262, in apply_top_k_top_p
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 446.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 422.81 MiB is free. Process 68196 has 23.27 GiB memory in use. Of the allocated memory 20.93 GiB is allocated by PyTorch, with 38.00 MiB allocated in private pools (e.g., CUDA Graphs), and 85.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] 
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] 
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     super().__init__(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 269, in _initialize_kv_caches
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 530, in compile_or_warm_up_model
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     self.model_runner._dummy_sampler_run(hidden_states=last_hidden_states)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4817, in _dummy_sampler_run
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]     raise RuntimeError(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
(EngineCore_DP0 pid=4537) Process EngineCore_DP0:
(EngineCore_DP0 pid=4537) Traceback (most recent call last):
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4812, in _dummy_sampler_run
(EngineCore_DP0 pid=4537)     sampler_output = self.sampler(
(EngineCore_DP0 pid=4537)                      ^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 96, in forward
(EngineCore_DP0 pid=4537)     sampled, processed_logprobs = self.sample(logits, sampling_metadata)
(EngineCore_DP0 pid=4537)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 187, in sample
(EngineCore_DP0 pid=4537)     random_sampled, processed_logprobs = self.topk_topp_sampler(
(EngineCore_DP0 pid=4537)                                          ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_native
(EngineCore_DP0 pid=4537)     logits = self.apply_top_k_top_p(logits, k, p)
(EngineCore_DP0 pid=4537)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 262, in apply_top_k_top_p
(EngineCore_DP0 pid=4537)     logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
(EngineCore_DP0 pid=4537)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 446.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 422.81 MiB is free. Process 68196 has 23.27 GiB memory in use. Of the allocated memory 20.93 GiB is allocated by PyTorch, with 38.00 MiB allocated in private pools (e.g., CUDA Graphs), and 85.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore_DP0 pid=4537) 
(EngineCore_DP0 pid=4537) The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=4537) 
(EngineCore_DP0 pid=4537) Traceback (most recent call last):
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=4537)     self.run()
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=4537)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=4537)     raise e
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=4537)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=4537)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=4537)     super().__init__(
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore_DP0 pid=4537)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=4537)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 269, in _initialize_kv_caches
(EngineCore_DP0 pid=4537)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=4537)     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=4537)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4537)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4537)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4537)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 530, in compile_or_warm_up_model
(EngineCore_DP0 pid=4537)     self.model_runner._dummy_sampler_run(hidden_states=last_hidden_states)
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4537)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4537)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537)   File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4817, in _dummy_sampler_run
(EngineCore_DP0 pid=4537)     raise RuntimeError(
(EngineCore_DP0 pid=4537) RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
      3 from transformers import AutoTokenizer
      5 # 初始化模型（可指定 GPU 数量、tensor_parallel_size 等）
----> 6 llm = LLM(model="models/Qwen/Qwen3-1.7B")  # 或本地路径
      7 tokenizer = AutoTokenizer.from_pretrained("models/Qwen/Qwen3-1.7B")
      9 # 定义采样参数

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/entrypoints/llm.py:334, in LLM.__init__(self, model, runner, convert, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, allowed_media_domains, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, enable_return_routed_experts, disable_custom_all_reduce, hf_token, hf_overrides, mm_processor_kwargs, pooler_config, structured_outputs_config, profiler_config, attention_config, kv_cache_memory_bytes, compilation_config, logits_processors, **kwargs)
    297 engine_args = EngineArgs(
    298     model=model,
    299     runner=runner,
   (...)    329     **kwargs,
    330 )
    332 log_non_default_args(engine_args)
--> 334 self.llm_engine = LLMEngine.from_engine_args(
    335     engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
    336 )
    337 self.engine_class = type(self.llm_engine)
    339 self.request_counter = Counter()

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:172, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers, enable_multiprocessing)
    169     enable_multiprocessing = True
    171 # Create the LLMEngine.
--> 172 return cls(
    173     vllm_config=vllm_config,
    174     executor_class=executor_class,
    175     log_stats=not engine_args.disable_log_stats,
    176     usage_context=usage_context,
    177     stat_loggers=stat_loggers,
    178     multiprocess_mode=enable_multiprocessing,
    179 )

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:106, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, aggregate_engine_logging, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
    103     self.output_processor.tracer = tracer
    105 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 106 self.engine_core = EngineCoreClient.make_client(
    107     multiprocess_mode=multiprocess_mode,
    108     asyncio_mode=False,
    109     vllm_config=vllm_config,
    110     executor_class=executor_class,
    111     log_stats=self.log_stats,
    112 )
    114 self.logger_manager: StatLoggerManager | None = None
    115 if self.log_stats:

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:94, in EngineCoreClient.make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
     89     return EngineCoreClient.make_async_mp_client(
     90         vllm_config, executor_class, log_stats
     91     )
     93 if multiprocess_mode and not asyncio_mode:
---> 94     return SyncMPClient(vllm_config, executor_class, log_stats)
     96 return InprocClient(vllm_config, executor_class, log_stats)

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:647, in SyncMPClient.__init__(self, vllm_config, executor_class, log_stats)
    644 def __init__(
    645     self, vllm_config: VllmConfig, executor_class: type[Executor], log_stats: bool
    646 ):
--> 647     super().__init__(
    648         asyncio_mode=False,
    649         vllm_config=vllm_config,
    650         executor_class=executor_class,
    651         log_stats=log_stats,
    652     )
    654     self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1
    655     self.outputs_queue = queue.Queue[EngineCoreOutputs | Exception]()

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:479, in MPClient.__init__(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
    476     self.stats_update_address = client_addresses.get("stats_update_address")
    477 else:
    478     # Engines are managed by this client.
--> 479     with launch_core_engines(vllm_config, executor_class, log_stats) as (
    480         engine_manager,
    481         coordinator,
    482         addresses,
    483     ):
    484         self.resources.coordinator = coordinator
    485         self.resources.engine_manager = engine_manager

File ~/miniconda3/envs/eval/lib/python3.12/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/utils.py:933, in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
    930 yield local_engine_manager, coordinator, addresses
    932 # Now wait for engines to start.
--> 933 wait_for_engine_startup(
    934     handshake_socket,
    935     addresses,
    936     engines_to_handshake,
    937     parallel_config,
    938     dp_size > 1 and vllm_config.model_config.is_moe,
    939     vllm_config.cache_config,
    940     local_engine_manager,
    941     coordinator.proc if coordinator else None,
    942 )

File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/utils.py:992, in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, coordinated_dp, cache_config, proc_manager, coord_process)
    990     if coord_process is not None and coord_process.exitcode is not None:
    991         finished[coord_process.name] = coord_process.exitcode
--> 992     raise RuntimeError(
    993         "Engine core initialization failed. "
    994         "See root cause above. "
    995         f"Failed core proc(s): {finished}"
    996     )
    998 # Receive HELLO and READY messages from the input socket.
    999 eng_identity, ready_msg_bytes = handshake_socket.recv_multipart()

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

longkong39 · February 5, 2026, 6:07am

（已解决）通过加入参数gpu_memory_utilization=0.5解决报错，默认的是0.9
llm = LLM(model=MODEL_NAME,gpu_memory_utilization=0.5)

Topic		Replies	Views
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	456	July 4, 2025
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	168	November 27, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	832	March 3, 2026
RTX PRO 6000 users seek help, LLAMA 4 NVFP4 NVIDIA GPU Support	1	260	November 25, 2025
Engine core initialization failed. See root cause above. Failed core proc(s): {‘EngineCore_0’: 1} General	0	2109	August 4, 2025

推理时报错RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Related topics