Hi forum,
As I followed the steps in [Doc]: Steps to run vLLM on your RTX5080 or 5090! · Issue #14452 · vllm-project/vllm · GitHub, I could built vllm v0.9.0 docker image and run it on 5090, but DockerHub–vLLM v0.9.0 doesn’t with error message–RuntimeError: CUDA error: no kernel image is available for execution on the device
on inference time (api call to /v1/chat/completions
).
Does anybody have same issue with mine? or should I test with another way? Here’s the way I’ve used:
-
docker run command
docker run -d --gpus "device=0" -p "8000:8000" -v ./model/:/model/:ro -v ./entrypoint.sh:/workspace/entrypoint.sh:ro --entrypoint /workspace/entrypoint.sh vllm/vllm-openai:v0.9.0
-
entrypoint.sh
#!/bin/bash
python3 -m vllm.entrypoints.openai.api_server \
--model /model \
--host 0.0.0.0 --port 8000 \
--trust-remote-code
note: model I used is llama-based one. it works well on 4090, 6000 ada
Yes, this is a known issue: the official vLLM DockerHub image v0.9.0 does not support RTX 5090 (Blackwell, compute capability sm_120) out of the box, resulting in RuntimeError: CUDA error: no kernel image is available for execution on the device. This is because the prebuilt image lacks CUDA kernels for the new architecture. You must build the Docker image yourself with the correct CUDA arch flags (e.g., --build-arg torch_cuda_arch_list=“12.0 12.1”) to ensure compatibility with the 5090, as described in vLLM issue #14452, vLLM issue #16901, and vLLM issue #19098.
Until an official image with Blackwell support is released, continue using your custom-built image. The official image works on 4090/6000 Ada because those architectures are included in the prebuilt kernels, but not for 5090. For more details and troubleshooting, see the linked issues above.
Would you like step-by-step build instructions or troubleshooting tips?
Sources: