Why is this not working? I corrected it but still

artist-genai · May 8, 2025, 6:09am

FYI: Nvidia driver - 570, Cuda - 12.5 and Nvidia Container toolkit is installed too.
root:~# docker run -d --name vllmqwen --runtime=nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env “HUGGING_FACE_HUB_TOKEN=” -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct-AWQ --gpu-memory-utilization 0.95 --dtype float16 --enforce-eager --max-model-len 2048
2b6be09fe6f401fec0ebe2dde060a2caaac8edd0f6899c74a56a19e7a9229873
d
root:~# docker logs -f vllmqwen
INFO 05-07 23:00:43 [init.py:239] Automatically detected platform cuda.
INFO 05-07 23:00:48 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-07 23:00:48 [api_server.py:1044] args: Namespace(host=None, port=8000, uvicorn_log_level=‘info’, disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[‘‘], allowed_methods=[’’], allowed_headers=[‘‘], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format=‘auto’, response_role=‘assistant’, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin=’‘, model=‘Qwen/Qwen2.5-14B-Instruct-AWQ’, task=‘auto’, tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode=‘auto’, trust_remote_code=False, allowed_local_media_path=None, load_format=‘auto’, download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: ‘auto’>, dtype=‘float16’, max_model_len=2048, guided_decoding_backend=‘auto’, reasoning_parser=None, logits_processor_pattern=None, model_impl=‘auto’, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.95, swap_space=4, kv_cache_dtype=‘auto’, num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo=‘builtin’, cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type=‘ray’, tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype=‘auto’, long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device=‘auto’, speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy=‘fcfs’, enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls=‘vllm.core.scheduler.Scheduler’, override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls=‘auto’, worker_extension_cls=’‘, generation_config=‘auto’, override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 05-07 23:00:59 [config.py:717] This model supports multiple tasks: {‘generate’, ‘classify’, ‘score’, ‘embed’, ‘reward’}. Defaulting to ‘generate’.
WARNING 05-07 23:01:00 [config.py:830] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 05-07 23:01:00 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
WARNING 05-07 23:01:00 [cuda.py:93] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 05-07 23:01:00 [api_server.py:246] Started engine process with PID 48
INFO 05-07 23:01:05 [init.py:239] Automatically detected platform cuda.
INFO 05-07 23:01:07 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model=‘Qwen/Qwen2.5-14B-Instruct-AWQ’, speculative_config=None, tokenizer=‘Qwen/Qwen2.5-14B-Instruct-AWQ’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘auto’, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-14B-Instruct-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={“splitting_ops”:[],“compile_sizes”:[],“cudagraph_capture_sizes”:[],“max_capture_size”:0}, use_cached_outputs=True,
INFO 05-07 23:01:09 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-07 23:01:09 [cuda.py:289] Using XFormers backend.
INFO 05-07 23:01:11 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-07 23:01:11 [model_runner.py:1108] Starting to load model Qwen/Qwen2.5-14B-Instruct-AWQ…
INFO 05-07 23:01:12 [weight_utils.py:265] Using model weights format [’.safetensors’]
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.40it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:02<00:01, 1.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00, 1.38s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00, 1.28s/it]

INFO 05-07 23:01:16 [loader.py:458] Loading weights took 3.97 seconds
INFO 05-07 23:01:16 [model_runner.py:1140] Model loading took 9.3722 GiB and 5.167720 seconds
INFO 05-07 23:01:19 [worker.py:287] Memory profiling takes 2.95 seconds
INFO 05-07 23:01:19 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.95) = 13.84GiB
INFO 05-07 23:01:19 [worker.py:287] model weights take 9.37GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 3.01GiB.
INFO 05-07 23:01:20 [executor_base.py:112] # cuda blocks: 1027, # CPU blocks: 1365
INFO 05-07 23:01:20 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 8.02x
ERROR 05-07 23:01:21 [engine.py:448] CUDA error: invalid argument
ERROR 05-07 23:01:21 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-07 23:01:21 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-07 23:01:21 [engine.py:448] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 05-07 23:01:21 [engine.py:448] Traceback (most recent call last):
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 436, in run_mp_engine
ERROR 05-07 23:01:21 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 05-07 23:01:21 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 128, in from_vllm_config
ERROR 05-07 23:01:21 [engine.py:448] return cls(
Process SpawnProcess-1:
ERROR 05-07 23:01:21 [engine.py:448] ^^^^
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 82, in init
ERROR 05-07 23:01:21 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 05-07 23:01:21 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 278, in init
ERROR 05-07 23:01:21 [engine.py:448] self._initialize_kv_caches()
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 435, in _initialize_kv_caches
ERROR 05-07 23:01:21 [engine.py:448] self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py”, line 123, in initialize_cache
ERROR 05-07 23:01:21 [engine.py:448] self.collective_rpc(“initialize_cache”,
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py”, line 56, in collective_rpc
ERROR 05-07 23:01:21 [engine.py:448] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-07 23:01:21 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/utils.py”, line 2456, in run_method
ERROR 05-07 23:01:21 [engine.py:448] return func(*args, **kwargs)
ERROR 05-07 23:01:21 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 327, in initialize_cache
ERROR 05-07 23:01:21 [engine.py:448] self._init_cache_engine()
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 333, in _init_cache_engine
ERROR 05-07 23:01:21 [engine.py:448] CacheEngine(self.cache_config, self.model_config,
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 66, in init
ERROR 05-07 23:01:21 [engine.py:448] self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, “cpu”)
ERROR 05-07 23:01:21 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 23:01:21 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 95, in _allocate_kv_cache
ERROR 05-07 23:01:21 [engine.py:448] layer_kv_cache = torch.zeros(
ERROR 05-07 23:01:21 [engine.py:448] ^^^^^^^^^^^^
ERROR 05-07 23:01:21 [engine.py:448] RuntimeError: CUDA error: invalid argument
ERROR 05-07 23:01:21 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-07 23:01:21 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-07 23:01:21 [engine.py:448] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 05-07 23:01:21 [engine.py:448]
Traceback (most recent call last):
File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
self.run()
File “/usr/lib/python3.12/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 450, in run_mp_engine
raise e
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 128, in from_vllm_config
return cls(
^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 82, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 278, in init
self._initialize_kv_caches()
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 435, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File “/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py”, line 123, in initialize_cache
self.collective_rpc(“initialize_cache”,
File “/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py”, line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/utils.py”, line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 327, in initialize_cache
self._init_cache_engine()
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 333, in _init_cache_engine
CacheEngine(self.cache_config, self.model_config,
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 66, in init
self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, “cpu”)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 95, in _allocate_kv_cache
layer_kv_cache = torch.zeros(
^^^^^^^^^^^^
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank0]:[W507 23:01:21.158637882 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.7 documentation (function operator())
Traceback (most recent call last):
File “”, line 198, in _run_module_as_main
File “”, line 88, in _run_code
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1130, in
uvloop.run(run_server(args))
File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 61, in wrapper
return await main
^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

RunLLM · May 8, 2025, 6:10am

Your error, RuntimeError: CUDA error: invalid argument, occurs during the allocation of the KV cache, and is likely due to an incompatibility between the vLLM engine and your GPU’s compute capability. The log shows: “Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.” and “Cannot use FlashAttention-2 backend for Volta and Turing GPUs. Using XFormers backend”—this means your GPU is not Ampere or newer, and vLLM V1 features (and some quantization backends) are not supported on your hardware. This is a known limitation and can cause CUDA errors during cache allocation or model execution (source, source).

To troubleshoot, try the following:

Ensure your GPU is Ampere (compute capability 8.0+) or newer for full V1 support and AWQ quantization.
If you are on Turing/Volta, you must use the V0 engine, but some quantization types (like AWQ) may not be fully supported or optimized, and you may encounter CUDA errors.
You can try reducing --max-model-len and --gpu-memory-utilization further, but this may not resolve the underlying hardware limitation.
For debugging, set the environment variable CUDA_LAUNCH_BLOCKING=1 to get more precise error locations (source).

Would you like a step-by-step breakdown or help identifying your GPU’s compute capability?

Sources:

Topic		Replies	Views
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	240	May 8, 2025
2 vllm containers on a single GPU General	1	196	May 13, 2025
Docker explosion this morning after it worked fine for a long while NVIDIA GPU Support	6	159	May 6, 2025
Docker image `vllm/vllm-openai:v0.9.0` doesn't work on 5090 General	3	203	June 10, 2025
Errors When Running VLLM + DeepSeek on RTX 5090 — Existing Solutions Not Working General	1	900	May 21, 2025

Why is this not working? I corrected it but still

Related topics