Can anyone help me? Why is this not working? It used 😭

FYI: Nvidia T4 with driver version 570, cuda 12.8 installed in the system and Nvidia Container tool kit is installed too.

root:~# docker run -d --name vllmqwen --runtime=nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env “HUGGING_FACE_HUB_TOKEN=” -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct --gpu-memory-utilization 0.95 --quantization bitsandbytes --dtype float16 --enforce-eager --max-model-len 2048
e807dc73ad5a3bd0a5a8285d64924a0cd699c8a3c759561b6de8ca5f7a6e406c
root:~# docker logs -f vllmqwen
INFO 05-07 22:30:56 [init.py:239] Automatically detected platform cuda.
INFO 05-07 22:31:00 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-07 22:31:00 [api_server.py:1044] args: Namespace(host=None, port=8000, uvicorn_log_level=‘info’, disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[‘‘], allowed_methods=[’’], allowed_headers=[‘‘], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format=‘auto’, response_role=‘assistant’, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin=’‘, model=‘Qwen/Qwen2.5-14B-Instruct’, task=‘auto’, tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode=‘auto’, trust_remote_code=False, allowed_local_media_path=None, load_format=‘auto’, download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: ‘auto’>, dtype=‘float16’, max_model_len=2048, guided_decoding_backend=‘auto’, reasoning_parser=None, logits_processor_pattern=None, model_impl=‘auto’, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.95, swap_space=4, kv_cache_dtype=‘auto’, num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo=‘builtin’, cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=‘bitsandbytes’, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type=‘ray’, tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype=‘auto’, long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device=‘auto’, speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy=‘fcfs’, enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls=‘vllm.core.scheduler.Scheduler’, override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls=‘auto’, worker_extension_cls=’‘, generation_config=‘auto’, override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
WARNING 05-07 22:31:01 [config.py:2972] Casting torch.bfloat16 to torch.float16.
INFO 05-07 22:31:10 [config.py:717] This model supports multiple tasks: {‘reward’, ‘generate’, ‘classify’, ‘embed’, ‘score’}. Defaulting to ‘generate’.
WARNING 05-07 22:31:10 [config.py:830] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 05-07 22:31:11 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
WARNING 05-07 22:31:11 [cuda.py:93] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 05-07 22:31:12 [api_server.py:246] Started engine process with PID 48
INFO 05-07 22:31:16 [init.py:239] Automatically detected platform cuda.
INFO 05-07 22:31:18 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model=‘Qwen/Qwen2.5-14B-Instruct’, speculative_config=None, tokenizer=‘Qwen/Qwen2.5-14B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘auto’, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-14B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={“splitting_ops”:[],“compile_sizes”:[],“cudagraph_capture_sizes”:[],“max_capture_size”:0}, use_cached_outputs=True,
INFO 05-07 22:31:20 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-07 22:31:20 [cuda.py:289] Using XFormers backend.
INFO 05-07 22:31:22 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-07 22:31:22 [model_runner.py:1108] Starting to load model Qwen/Qwen2.5-14B-Instruct…
INFO 05-07 22:31:23 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while …
INFO 05-07 22:31:23 [weight_utils.py:265] Using model weights format [’
.safetensors’]
Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 12% Completed | 1/8 [00:11<01:19, 11.39s/it]
Loading safetensors checkpoint shards: 25% Completed | 2/8 [00:22<01:07, 11.22s/it]
Loading safetensors checkpoint shards: 38% Completed | 3/8 [00:33<00:56, 11.31s/it]
Loading safetensors checkpoint shards: 50% Completed | 4/8 [00:45<00:45, 11.35s/it]
Loading safetensors checkpoint shards: 62% Completed | 5/8 [00:56<00:34, 11.36s/it]
Loading safetensors checkpoint shards: 75% Completed | 6/8 [01:01<00:18, 9.14s/it]
Loading safetensors checkpoint shards: 88% Completed | 7/8 [01:12<00:09, 9.89s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [01:24<00:00, 10.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [01:24<00:00, 10.54s/it]

INFO 05-07 22:32:49 [model_runner.py:1140] Model loading took 9.3407 GiB and 86.104178 seconds
INFO 05-07 22:33:11 [worker.py:287] Memory profiling takes 21.79 seconds
INFO 05-07 22:33:11 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.95) = 13.84GiB
INFO 05-07 22:33:11 [worker.py:287] model weights take 9.34GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 3.04GiB.
INFO 05-07 22:33:11 [executor_base.py:112] # cuda blocks: 1037, # CPU blocks: 1365
INFO 05-07 22:33:11 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 8.10x
ERROR 05-07 22:33:11 [engine.py:448] CUDA error: invalid argument
ERROR 05-07 22:33:11 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-07 22:33:11 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-07 22:33:11 [engine.py:448] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 05-07 22:33:11 [engine.py:448] Traceback (most recent call last):
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 436, in run_mp_engine
ERROR 05-07 22:33:11 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 05-07 22:33:11 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 128, in from_vllm_config
ERROR 05-07 22:33:11 [engine.py:448] return cls(
ERROR 05-07 22:33:11 [engine.py:448] ^^^^
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 82, in init
ERROR 05-07 22:33:11 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 05-07 22:33:11 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 278, in init
ERROR 05-07 22:33:11 [engine.py:448] self._initialize_kv_caches()
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 435, in _initialize_kv_caches
ERROR 05-07 22:33:11 [engine.py:448] self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py”, line 123, in initialize_cache
ERROR 05-07 22:33:11 [engine.py:448] self.collective_rpc(“initialize_cache”,
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py”, line 56, in collective_rpc
ERROR 05-07 22:33:11 [engine.py:448] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-07 22:33:11 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/utils.py”, line 2456, in run_method
ERROR 05-07 22:33:11 [engine.py:448] return func(*args, **kwargs)
ERROR 05-07 22:33:11 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 327, in initialize_cache
ERROR 05-07 22:33:11 [engine.py:448] self._init_cache_engine()
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 333, in _init_cache_engine
ERROR 05-07 22:33:11 [engine.py:448] CacheEngine(self.cache_config, self.model_config,
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 66, in init
ERROR 05-07 22:33:11 [engine.py:448] self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, “cpu”)
ERROR 05-07 22:33:11 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 22:33:11 [engine.py:448] File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 95, in _allocate_kv_cache
ERROR 05-07 22:33:11 [engine.py:448] layer_kv_cache = torch.zeros(
ERROR 05-07 22:33:11 [engine.py:448] ^^^^^^^^^^^^
ERROR 05-07 22:33:11 [engine.py:448] RuntimeError: CUDA error: invalid argument
ERROR 05-07 22:33:11 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-07 22:33:11 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-07 22:33:11 [engine.py:448] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 05-07 22:33:11 [engine.py:448]
Process SpawnProcess-1:
Traceback (most recent call last):
File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
self.run()
File “/usr/lib/python3.12/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 450, in run_mp_engine
raise e
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 128, in from_vllm_config
return cls(
^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py”, line 82, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 278, in init
self._initialize_kv_caches()
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py”, line 435, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File “/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py”, line 123, in initialize_cache
self.collective_rpc(“initialize_cache”,
File “/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py”, line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/utils.py”, line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 327, in initialize_cache
self._init_cache_engine()
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py”, line 333, in _init_cache_engine
CacheEngine(self.cache_config, self.model_config,
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 66, in init
self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, “cpu”)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py”, line 95, in _allocate_kv_cache
layer_kv_cache = torch.zeros(
^^^^^^^^^^^^
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank0]:[W507 22:33:12.616558800 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.7 documentation (function operator())
Traceback (most recent call last):
File “”, line 198, in _run_module_as_main
File “”, line 88, in _run_code
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1130, in
uvloop.run(run_server(args))
File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 61, in wrapper
return await main
^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Your error RuntimeError: CUDA error: invalid argument during torch.zeros in cache_engine.py is a known issue when using vLLM with bitsandbytes quantization on Turing GPUs (like T4) and V1 engine. The logs show vLLM automatically falls back to V0 engine, but the error persists. This is likely due to bitsandbytes quantization not being fully optimized or supported for your hardware and/or the V0 engine, especially with large models like Qwen2.5-14B-Instruct on a T4 (16GB VRAM) GPU. The error is not caused by your CUDA or driver version, but by a kernel argument mismatch or unsupported operation in the quantized path for this hardware and model size (source, source).

To resolve, try running without quantization (remove --quantization bitsandbytes and --dtype float16), or use a smaller model. If you must use quantization, consider using a different quantization method (like AWQ or GPTQ if supported by your model and hardware), or try a different vLLM version. Also, ensure your Docker image and vLLM are up to date, and that your container has enough shared memory (--shm-size=24g is recommended for large models) (source). Would you like a step-by-step troubleshooting guide?

Sources: