I’m currently trying to install and run VLLM on a system with an NVIDIA RTX 5090 GPU, along with loading open-source models like DeepSeek. However, I’ve encountered errors during configuration and runtime, and I haven’t been able to get it working properly.
I’ve already tried several solutions found online, including this GitHub issue:
I followed the suggested modifications in that thread, but the problem still persists and errors continue to appear when running the code.
To install and run vLLM on an NVIDIA RTX 5090, you must build vLLM from source using a container with CUDA 12.8 and a compatible PyTorch nightly (2.6+), as pre-built wheels and official Docker images do not yet support Blackwell GPUs out-of-the-box. The official instructions require you to use the NGC PyTorch container, clone the vLLM repo, run python use_existing_torch.py, install build requirements, and then build with python setup.py develop, optionally using ccache and MAX_JOBS for faster builds. Flash Attention 3 is not yet supported on Blackwell, so set VLLM_FLASH_ATTN_VERSION=2 in your environment if you encounter issues. See the full step-by-step in this GitHub issue and the official docs.
If you still encounter errors after following these steps, common issues include mismatched CUDA/PyTorch versions, missing dependencies, or using an outdated vLLM commit. Ensure you are using the latest vLLM source, the correct CUDA toolkit, and a PyTorch nightly with CUDA 12.8 support. For multi-GPU or FP8 model support, additional workarounds or patches may be required, and full FP8 support is not yet stable on 5090. For troubleshooting, see this section of the docs and consider using the community-maintained Dockerfiles and guides referenced in the issue threads. Would you like a detailed step-by-step build and run example?