@RunLLM The Qwen/Qwen2.5-7B-Instruct-AWQ started.
(APIServer pid=591196) INFO: Started server process [591196]
(APIServer pid=591196) INFO: Waiting for application startup.
(APIServer pid=591196) INFO: Application startup complete.
Thereafter, no command prompt is return.
I had to open a new terminal, go into the project directory, do source .venv/bin/activate and run the python sample script you had provided me. However, it failed. How do I resolve the errors?
$ python main.py
INFO 03-03 12:24:37 [utils.py:229] non-default args: {‘disable_log_stats’: True, ‘quantization’: ‘awq’, ‘model’: ‘Qwen/Qwen2.5-7B-Instruct-AWQ’}
INFO 03-03 12:24:39 [model.py:530] Resolved architecture: Qwen2ForCausalLM
INFO 03-03 12:24:39 [model.py:1553] Using max model len 32768
INFO 03-03 12:24:40 [awq_marlin.py:166] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference
INFO 03-03 12:24:40 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
Parse safetensors files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.56it/s]
INFO 03-03 12:24:41 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=603265) INFO 03-03 12:24:43 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev158+g168ee03e1) with config: model=‘Qwen/Qwen2.5-7B-Instruct-AWQ’, speculative_config=None, tokenizer=‘Qwen/Qwen2.5-7B-Instruct-AWQ’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct-AWQ, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_split_points’: [8192], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py”, line 1090, in run_engine_core
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py”, line 834, in init
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] super().init(
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] vllm_config,
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] …<3 lines>…
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] internal_dp_balancing,
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] )
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py”, line 110, in init
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self._init_executor()
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/uniproc_executor.py”, line 48, in _init_executor
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self.driver_worker.init_device()
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py”, line 326, in init_device
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py”, line 256, in init_device
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] current_platform.set_device(self.device)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/platforms/cuda.py”, line 145, in set_device
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] _ = torch.zeros(1, device=device)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] torch.AcceleratorError: CUDA error: out of memory
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] Search for cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] (EngineCore_DP0 pid=603265) Process EngineCore_DP0: (EngineCore_DP0 pid=603265) Traceback (most recent call last): (EngineCore_DP0 pid=603265) File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap (EngineCore_DP0 pid=603265) self.run() (EngineCore_DP0 pid=603265) ~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=603265) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=603265) raise e (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=603265) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=603265) return func(*args, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 834, in __init__ (EngineCore_DP0 pid=603265) super().__init__( (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~^ (EngineCore_DP0 pid=603265) vllm_config, (EngineCore_DP0 pid=603265) ^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) ...<3 lines>... (EngineCore_DP0 pid=603265) internal_dp_balancing, (EngineCore_DP0 pid=603265) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) ) (EngineCore_DP0 pid=603265) ^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 110, in __init__ (EngineCore_DP0 pid=603265) self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=603265) return func(*args, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__ (EngineCore_DP0 pid=603265) self._init_executor() (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor (EngineCore_DP0 pid=603265) self.driver_worker.init_device() (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device (EngineCore_DP0 pid=603265) self.worker.init_device() # type: ignore (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=603265) return func(*args, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 256, in init_device (EngineCore_DP0 pid=603265) current_platform.set_device(self.device) (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/platforms/cuda.py", line 145, in set_device (EngineCore_DP0 pid=603265) _ = torch.zeros(1, device=device) (EngineCore_DP0 pid=603265) torch.AcceleratorError: CUDA error: out of memory (EngineCore_DP0 pid=603265) Search forcudaErrorMemoryAllocation’ in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html` for more information. (EngineCore_DP0 pid=603265) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=603265) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=603265) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (EngineCore_DP0 pid=603265) Traceback (most recent call last): File “/home/vllm/project1/main.py”, line 4, in llm = LLM(model=“Qwen/Qwen2.5-7B-Instruct-AWQ”, quantization=“awq”) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/entrypoints/llm.py”, line 373, in **init** self.llm_engine = LLMEngine.from_engine_args( ^ engine_args=engine_args, usage_context=UsageContext.LLM_CLASS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/llm_engine.py”, line 177, in from_engine_args return cls( vllm_config=vllm_config, …<4 lines>… multiprocess_mode=enable_multiprocessing, ) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/llm_engine.py”, line 111, in **init** self.engine_core = EngineCoreClient.make_client( ^ multiprocess_mode=multiprocess_mode, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …<3 lines>… log_stats=self.log_stats, ^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py”, line 98, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper return func(*args, **kwargs) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py”, line 729, in **init** super().**init**( ^ asyncio_mode=False, ^^^^^^^^^^^^^^^^^^^ …<2 lines>… log_stats=log_stats, ^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py”, line 569, in **init** with launch_core_engines( ~^ vllm_config, ^^^^^^^^^^^^ …<2 lines>… addresses, ^^^^^^^^^^ ) as (engine_manager, coordinator, addresses): ^ File “/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/contextlib.py”, line 148, in **exit** next(self.gen) ^^^^^^^^^^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/utils.py”, line 951, in launch_core_engines wait_for_engine_startup( ~~~~~~~~~~~~~~~~~~~~~~~^ handshake_socket, ^^^^^^^^^^^^^^^^^ …<6 lines>… coordinator.proc if coordinator else None, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/utils.py”, line 1010, in wait_for_engine_startup raise RuntimeError( …<3 lines>… ) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}`