Errors When Running VLLM + DeepSeek on RTX 5090 — Existing Solutions Not Working

IwanZhang · May 21, 2025, 4:17am

Hello everyone,

I’m currently trying to install and run VLLM on a system with an NVIDIA RTX 5090 GPU, along with loading open-source models like DeepSeek. However, I’ve encountered errors during configuration and runtime, and I haven’t been able to get it working properly.

I’ve already tried several solutions found online, including this GitHub issue:

github.com/vllm-project/vllm

[Doc]: Steps to run vLLM on your RTX5080 or 5090!

opened 06:12PM - 07 Mar 25 UTC

pavanimajety

documentation

### 📚 The doc issue Let's take a look at the steps required to run vLLM on your… RTX5080/5090! 1. **Initial Setup:** To start with, we need a container that has CUDA 12.8 and PyTorch 2.6 so that we have nvcc that can compile for Blackwell. ``` docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ -it nvcr.io/nvidia/pytorch:25.02-py3 /bin/bash ``` 2. **Clone vLLM Repository:** Let's clone top of tree vLLM. If you have an existing clone or working directory, ensure that you are at or above the commit [ed6ea06](https://github.com/vllm-project/vllm/commit/ed6ea06577ec06f0b3a9ac921b55ef254f19d923) in your clone. ``` git clone https://github.com/vllm-project/vllm.git && cd vllm ``` 3. **Build vLLM in the container:** Now, we start building vLLM. Please note here that we can't use precompiled vLLM because `vllm-project/vllm` has not moved to the required torch and CUDA versions yet. So, we leverage the torch and CUDA versions that come with the NGC containers. The following steps are your standard build from source instructions, with the caveat of running `use_existing_torch.py` ``` python use_existing_torch.py pip install -r requirements/build.txt pip install setuptools_scm # optionally create a CACHE_DIR if you don't have your regular CCACHE_DIR mkdir <path/to/ccache/dir> CCACHE_DIR=<path/to/ccache/dir> python setup.py develop ``` Notes: - If `ccache` is not already installed, please install using - `apt-get update && apt-get install ccache`. - The following may also be needed based on your environment. ``` apt-get update && apt-get install -y --no-install-recommends \ kmod \ git \ python3-pip \ && apt-get clean && rm -rf /var/lib/apt/lists/* ``` - To speed up your process, you can leverage `MAX_JOBS` flag. Check the number of cores on your CPU using `nproc` and use it while running your build. For example, if your machine has 16 cores, MAX_JOBS=10 may be a good number to not overload your CPU with the build. Set it to `1` if you want a single threaded build or if you are running into any issues with your parallel build. ``` MAX_JOBS=<number> CCACHE_DIR=<path/to/ccache/dir> python setup.py develop ``` - Switch steps 1 and 2 based on whether or not you want to re-use your repository for development purposes. If you clone first and then start the container, you may have to give additional permissions for making changes to vLLM source in the container. 4. **Test vLLM**: Once your build succeeds, run the following to check your installation. ``` python -c "import vllm; print(vllm.__version__)" ``` You should see a compiled version of `vllm.0.7.4`+ Congratulations, your RTX5080/90 is now ready to run vLLM! Note: Flash Attention 3 backend doesn't work with Blackwell yet, please use `VLLM_FLASH_ATTN_VERSION=2` if you run into any issues. Thanks @ywang96 for testing this out! Thanks to @kushanam, @kaixih for all the Blackwell support PRs! ### Suggest a potential alternative/fix _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

I followed the suggested modifications in that thread, but the problem still persists and errors continue to appear when running the code.

RunLLM · May 21, 2025, 4:17am

To install and run vLLM on an NVIDIA RTX 5090, you must build vLLM from source using a container with CUDA 12.8 and a compatible PyTorch nightly (2.6+), as pre-built wheels and official Docker images do not yet support Blackwell GPUs out-of-the-box. The official instructions require you to use the NGC PyTorch container, clone the vLLM repo, run python use_existing_torch.py, install build requirements, and then build with python setup.py develop, optionally using ccache and MAX_JOBS for faster builds. Flash Attention 3 is not yet supported on Blackwell, so set VLLM_FLASH_ATTN_VERSION=2 in your environment if you encounter issues. See the full step-by-step in this GitHub issue and the official docs.

If you still encounter errors after following these steps, common issues include mismatched CUDA/PyTorch versions, missing dependencies, or using an outdated vLLM commit. Ensure you are using the latest vLLM source, the correct CUDA toolkit, and a PyTorch nightly with CUDA 12.8 support. For multi-GPU or FP8 model support, additional workarounds or patches may be required, and full FP8 support is not yet stable on 5090. For troubleshooting, see this section of the docs and consider using the community-maintained Dockerfiles and guides referenced in the issue threads. Would you like a detailed step-by-step build and run example?

Sources:

Topic		Replies	Views
Docker image `vllm/vllm-openai:v0.9.0` doesn't work on 5090 General	1	18	June 5, 2025
Docker explosion this morning after it worked fine for a long while NVIDIA GPU Support	6	74	May 6, 2025
Why is this not working? I corrected it but still NVIDIA GPU Support	1	77	May 8, 2025
How to setup amd gpu as default in dual stack gpu? AMD GPU Support	10	113	April 21, 2025
2 vllm containers on a single GPU General	1	82	May 13, 2025

Errors When Running VLLM + DeepSeek on RTX 5090 — Existing Solutions Not Working

Related topics