3090推理Qwen3-1.7B模型经常遇到这个问题
/root/miniconda3/envs/eval/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
INFO 02-05 11:42:21 [utils.py:261] non-default args: {'disable_log_stats': True, 'model': 'models/Qwen/Qwen3-1.7B'}
INFO 02-05 11:42:21 [model.py:541] Resolved architecture: Qwen3ForCausalLM
INFO 02-05 11:42:21 [model.py:1561] Using max model len 40960
2026-02-05 11:42:21,182 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO 02-05 11:42:21 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 02-05 11:42:21 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:21 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='models/Qwen/Qwen3-1.7B', speculative_config=None, tokenizer='models/Qwen/Qwen3-1.7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=models/Qwen/Qwen3-1.7B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:24 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.7:48773 backend=nccl
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:24 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:25 [gpu_model_runner.py:4033] Starting to load model models/Qwen/Qwen3-1.7B...
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:26 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.99it/s]
(EngineCore_DP0 pid=4537)
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:27 [default_loader.py:291] Loading weights took 1.13 seconds
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:28 [gpu_model_runner.py:4130] Model loading took 3.22 GiB memory and 1.997281 seconds
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:37 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/02e558d41a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:37 [backends.py:872] Dynamo bytecode transform time: 8.88 s
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:43 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 0.919 s
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:43 [monitor.py:34] torch.compile takes 9.80 s in total
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:44 [gpu_worker.py:356] Available KV cache memory: 16.65 GiB
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:44 [kv_cache_utils.py:1307] GPU KV cache size: 155,904 tokens
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:44 [kv_cache_utils.py:1312] Maximum concurrency for 40,960 tokens per request: 3.81x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:02<00:00, 19.84it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 21.82it/s]
(EngineCore_DP0 pid=4537) INFO 02-05 11:42:49 [gpu_model_runner.py:5063] Graph capturing finished in 5 secs, took 1.99 GiB
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4812, in _dummy_sampler_run
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] sampler_output = self.sampler(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 96, in forward
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] sampled, processed_logprobs = self.sample(logits, sampling_metadata)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 187, in sample
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] random_sampled, processed_logprobs = self.topk_topp_sampler(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_native
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] logits = self.apply_top_k_top_p(logits, k, p)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 262, in apply_top_k_top_p
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 446.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 422.81 MiB is free. Process 68196 has 23.27 GiB memory in use. Of the allocated memory 20.93 GiB is allocated by PyTorch, with 38.00 MiB allocated in private pools (e.g., CUDA Graphs), and 85.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946]
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] super().__init__(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 269, in _initialize_kv_caches
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] return func(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 530, in compile_or_warm_up_model
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] self.model_runner._dummy_sampler_run(hidden_states=last_hidden_states)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] return func(*args, **kwargs)
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4817, in _dummy_sampler_run
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] raise RuntimeError(
(EngineCore_DP0 pid=4537) ERROR 02-05 11:42:49 [core.py:946] RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
(EngineCore_DP0 pid=4537) Process EngineCore_DP0:
(EngineCore_DP0 pid=4537) Traceback (most recent call last):
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4812, in _dummy_sampler_run
(EngineCore_DP0 pid=4537) sampler_output = self.sampler(
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 96, in forward
(EngineCore_DP0 pid=4537) sampled, processed_logprobs = self.sample(logits, sampling_metadata)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 187, in sample
(EngineCore_DP0 pid=4537) random_sampled, processed_logprobs = self.topk_topp_sampler(
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4537) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4537) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_native
(EngineCore_DP0 pid=4537) logits = self.apply_top_k_top_p(logits, k, p)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 262, in apply_top_k_top_p
(EngineCore_DP0 pid=4537) logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 446.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 422.81 MiB is free. Process 68196 has 23.27 GiB memory in use. Of the allocated memory 20.93 GiB is allocated by PyTorch, with 38.00 MiB allocated in private pools (e.g., CUDA Graphs), and 85.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore_DP0 pid=4537)
(EngineCore_DP0 pid=4537) The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=4537)
(EngineCore_DP0 pid=4537) Traceback (most recent call last):
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=4537) self.run()
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=4537) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=4537) raise e
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=4537) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=4537) super().__init__(
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore_DP0 pid=4537) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 269, in _initialize_kv_caches
(EngineCore_DP0 pid=4537) self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
(EngineCore_DP0 pid=4537) self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=4537) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4537) return func(*args, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 530, in compile_or_warm_up_model
(EngineCore_DP0 pid=4537) self.model_runner._dummy_sampler_run(hidden_states=last_hidden_states)
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4537) return func(*args, **kwargs)
(EngineCore_DP0 pid=4537) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4537) File "/root/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4817, in _dummy_sampler_run
(EngineCore_DP0 pid=4537) raise RuntimeError(
(EngineCore_DP0 pid=4537) RuntimeError: CUDA out of memory occurred when warming up sampler with 256 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[1], line 6
3 from transformers import AutoTokenizer
5 # 初始化模型(可指定 GPU 数量、tensor_parallel_size 等)
----> 6 llm = LLM(model="models/Qwen/Qwen3-1.7B") # 或本地路径
7 tokenizer = AutoTokenizer.from_pretrained("models/Qwen/Qwen3-1.7B")
9 # 定义采样参数
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/entrypoints/llm.py:334, in LLM.__init__(self, model, runner, convert, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, allowed_media_domains, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, enable_return_routed_experts, disable_custom_all_reduce, hf_token, hf_overrides, mm_processor_kwargs, pooler_config, structured_outputs_config, profiler_config, attention_config, kv_cache_memory_bytes, compilation_config, logits_processors, **kwargs)
297 engine_args = EngineArgs(
298 model=model,
299 runner=runner,
(...) 329 **kwargs,
330 )
332 log_non_default_args(engine_args)
--> 334 self.llm_engine = LLMEngine.from_engine_args(
335 engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
336 )
337 self.engine_class = type(self.llm_engine)
339 self.request_counter = Counter()
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:172, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers, enable_multiprocessing)
169 enable_multiprocessing = True
171 # Create the LLMEngine.
--> 172 return cls(
173 vllm_config=vllm_config,
174 executor_class=executor_class,
175 log_stats=not engine_args.disable_log_stats,
176 usage_context=usage_context,
177 stat_loggers=stat_loggers,
178 multiprocess_mode=enable_multiprocessing,
179 )
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py:106, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, aggregate_engine_logging, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
103 self.output_processor.tracer = tracer
105 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 106 self.engine_core = EngineCoreClient.make_client(
107 multiprocess_mode=multiprocess_mode,
108 asyncio_mode=False,
109 vllm_config=vllm_config,
110 executor_class=executor_class,
111 log_stats=self.log_stats,
112 )
114 self.logger_manager: StatLoggerManager | None = None
115 if self.log_stats:
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:94, in EngineCoreClient.make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
89 return EngineCoreClient.make_async_mp_client(
90 vllm_config, executor_class, log_stats
91 )
93 if multiprocess_mode and not asyncio_mode:
---> 94 return SyncMPClient(vllm_config, executor_class, log_stats)
96 return InprocClient(vllm_config, executor_class, log_stats)
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:647, in SyncMPClient.__init__(self, vllm_config, executor_class, log_stats)
644 def __init__(
645 self, vllm_config: VllmConfig, executor_class: type[Executor], log_stats: bool
646 ):
--> 647 super().__init__(
648 asyncio_mode=False,
649 vllm_config=vllm_config,
650 executor_class=executor_class,
651 log_stats=log_stats,
652 )
654 self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1
655 self.outputs_queue = queue.Queue[EngineCoreOutputs | Exception]()
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/core_client.py:479, in MPClient.__init__(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
476 self.stats_update_address = client_addresses.get("stats_update_address")
477 else:
478 # Engines are managed by this client.
--> 479 with launch_core_engines(vllm_config, executor_class, log_stats) as (
480 engine_manager,
481 coordinator,
482 addresses,
483 ):
484 self.resources.coordinator = coordinator
485 self.resources.engine_manager = engine_manager
File ~/miniconda3/envs/eval/lib/python3.12/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
142 if typ is None:
143 try:
--> 144 next(self.gen)
145 except StopIteration:
146 return False
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/utils.py:933, in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
930 yield local_engine_manager, coordinator, addresses
932 # Now wait for engines to start.
--> 933 wait_for_engine_startup(
934 handshake_socket,
935 addresses,
936 engines_to_handshake,
937 parallel_config,
938 dp_size > 1 and vllm_config.model_config.is_moe,
939 vllm_config.cache_config,
940 local_engine_manager,
941 coordinator.proc if coordinator else None,
942 )
File ~/miniconda3/envs/eval/lib/python3.12/site-packages/vllm/v1/engine/utils.py:992, in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, coordinated_dp, cache_config, proc_manager, coord_process)
990 if coord_process is not None and coord_process.exitcode is not None:
991 finished[coord_process.name] = coord_process.exitcode
--> 992 raise RuntimeError(
993 "Engine core initialization failed. "
994 "See root cause above. "
995 f"Failed core proc(s): {finished}"
996 )
998 # Receive HELLO and READY messages from the input socket.
999 eng_identity, ready_msg_bytes = handshake_socket.recv_multipart()
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}