vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128

KieranBicheno · September 1, 2025, 9:55am

vLLM on RTX 5090 Blackwell: A Technical Deep Dive

Hi folks, I spent most of today trying to get vLLM running with PyTorch 2.9.0 and it looks like the most recent build takes care of a lot of errors.

There are so many ways to get this wrong and I’m amazed it worked at all. I think I hit every issue on this forum to get to this point. I hope it helps anyone else working on the same issue to get things running.

Working Installation Process

System Configuration

GPU: NVIDIA GeForce RTX 5090 GB202 (Blackwell sm_120 architecture)
Driver: NVIDIA 575.64.03
CUDA: 12.8 (automatically supported by driver)
OS: Ubuntu 25.04 Plucky Puffin
RAM: 192 GB
CPU: AMD Ryzen 9 9900X3D (12-core)
Python: 3.12.10 (in virtual environment)

The Working Solution

The successful installation used PyTorch 2.9 nightly + vLLM source build:

vllm --version
# Output: 0.10.1rc2.dev413+g5438967fb.d20250901

This is a September 1, 2025 development build compiled from source with PyTorch 2.9.0.dev20250831+cu128.

Installation Steps (Confirmed Working Method)

Environment Setup

uv venv --python 3.12.10 --seed
source .venv/bin/activate

PyTorch 2.9 Nightly Installation

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
# Result: torch==2.9.0.dev20250831+cu128

vLLM Source Build

gh repo clone vllm-project/vllm    # This can just as easily be cloning from ssh
cd vllm
python use_existing_torch.py     # This line seems to be working now and it's important
pip install -r requirements/build.txt
export VLLM_FLASH_ATTN_VERSION=2
export TORCH_CUDA_ARCH_LIST="12.0"
MAX_JOBS=6 pip install --no-build-isolation -e .

Verification

vllm --version
# Should show: 0.10.1rc2.dev413+g5438967fb.d20250901

What Doesn’t Work (Confirmed Failures)

Pre-built Wheels from PyPI

All official releases: Lack Blackwell sm_120 support
Dependency conflicts: Version mismatches with PyTorch requirements
Missing wheel combinations: No PyTorch 2.8.x+cu128 wheels exist

Docker Container Approaches

All official images: Not ready for Blackwell architecture
NVIDIA NGC containers: Contain outdated vLLM versions incompatible with RTX 5090

Failed PyTorch Version Attempts

PyTorch 2.7.x: Only supports CUDA 12.6, incompatible with Blackwell sm_120
PyTorch 2.8.x: No cu128 wheels available anywhere (404 errors on GitHub releases)
Dependency resolution conflicts: vLLM nightly wants torch==2.8.0, but this doesn’t exist with cu128

Environment Variable Requirements

The successful build required specific Blackwell-targeting variables:

VLLM_FLASH_ATTN_VERSION=2  # FA3 unsupported on Blackwell
TORCH_CUDA_ARCH_LIST="12.0"  # Blackwell sm_120 architecture
MAX_JOBS=6  # Memory management during compilation

Performance Analysis

Startup and Warmup Behavior

Initial load: ~72 seconds (model download and compilation)
First request: 19.9 tokens/s (cold start)
Second request: 81.6 tokens/s (warming up)
Subsequent requests: 50-300+ tokens/s (stabilized)

Optimization Process

CUDA Graphs: Progressive optimization with request patterns
torch.compile: JIT compilation of hot code paths (3.36s initial compilation)
KV Cache: Memory utilization grows from 0.0% to 0.2% as patterns stabilize
Flash Attention: Using FA backend on V1 engine (FA3 not supported on Blackwell)

Memory Usage

Qwen2.5-7B: 31GB VRAM usage (aggressive KV cache pre-allocation)
Available KV cache: 27.09 GiB
Maximum concurrency: 288.98x for 1,024 token requests
Graph capturing: Additional 0.44 GiB

Sustained Performance

Decode speed: 290+ tokens/second (Qwen2.5-7B)
Response quality: Proper model behavior with appropriate stopping
Stability: No crashes or memory issues during extended testing

Technical Architecture Details

Blackwell-Specific Challenges

sm_120 compute capability: Newer than most software expects
CUDA 12.8 minimum requirement: Software ecosystem lagging behind hardware
Flash Attention limitations: FA3 unavailable, must use FA2
Kernel availability: Many operations lack optimized kernels for sm_120

Key Build Dependencies

From the successful installation, these packages were critical:

PyTorch: 2.9.0.dev20250831+cu128 (nightly build)
CUDA libraries: All 12.8.x versions (cublas, cudnn, etc.)
Build tools: cmake 4.1.0, ninja 1.13.0, setuptools-scm 9.2.0
Ray: 2.49.0 with cgraph support (cupy-cuda12x dependency)

vLLM Configuration

Model loading: 0.6611 GiB, 72.403091 seconds
Chunked prefill: enabled with max_num_batched_tokens=2048
Compilation level: 3 (highest optimization)
CUDA graphs: enabled with 67 capture sizes
Backend: Flash Attention V1 engine

Root Cause Analysis

Why Most Installations Fail

PyTorch Version Gap: No PyTorch 2.8.x+cu128 wheels exist, creating dependency deadlock
vLLM Version Constraints: Nightly builds expect torch==2.8.0 but must use torch>=2.9.0 for Blackwell (No idea why)
Architecture Support Lag: sm_120 support very recent, not in stable releases
Build Environment Requirements: Specific environment variables needed for Blackwell compilation

Why This Installation Worked

Source compilation: Bypassed pre-built wheel dependency conflicts
PyTorch 2.9 nightly: Only version with functional CUDA 12.8+sm_120 support
use_existing_torch.py: Critical script that cleaned dependency files to use existing PyTorch
Proper environment variables: VLLM_FLASH_ATTN_VERSION=2 and TORCH_CUDA_ARCH_LIST=“12.0”
Sufficient resources: 192GB RAM prevented memory-related build failures (change MAX_JOBS env variable for tighter RAM budgets)

Comparison with Alternative Solutions

I’ve only tested Ollama on this rig, but the documentation out there hints that this is a vLLM problem to a degree.

llama.cpp

Status: Confirmed working with pre-built wheels
Performance: 700+ tokens/sec prefill, good decode performance
Setup: Significantly easier, no dependency conflicts
Use case: Better choice for most users until vLLM stabilizes

TensorRT-LLM

Status: Working but requires complex setup
Performance: Highest potential (FP4 optimization)
Setup: Requires building from source, NVIDIA-specific optimizations
Use case: Best for maximum performance, enterprise deployments

Ollama

Status: Works reliably
Performance: Moderate (baseline comparison)
Setup: Trivial installation
Use case: Good fallback option, proven stability

Recommendations

For RTX 5090 Users (September 2025)

Try llama.cpp first: Most reliable path to working inference
For vLLM: Use the exact source build method documented above
Avoid pre-built wheels: All fail due to PyTorch version conflicts
Monitor development: New versions may resolve dependency issues

For vLLM Development Team

Update dependency constraints: Support PyTorch 2.9+ in nightly builds
Improve Blackwell documentation: Current guides don’t address sm_120 specifics
Pre-built wheel support: Provide wheels compiled with PyTorch 2.9+cu128
Environment detection: Auto-set VLLM_FLASH_ATTN_VERSION=2 for Blackwell GPUs

For Enterprise Adoption

Source build required: No stable pre-built solution exists
Test thoroughly: Performance characteristics still stabilizing
Monitor memory usage: Current builds are memory-aggressive
Have fallback plans: Keep alternative inference engines available

Timeline and Implications

Current State (September 2025)

Limited working installations: Mostly individual researchers/developers using source builds
No enterprise adoption: Production deployments require custom compilation
Blackwell support is improving as we test it: Weekly improvements in compatibility

Expected Evolution

Official support: Likely within 1-2 stable releases
Performance optimization: Memory usage and speed improvements
Documentation: Better guides for Blackwell-specific setup

Broader Context

This installation represents early adoption of new hardware with new software. While functional, it’s a sign of how legitimately painful running at the front of AI infrastructure capabilities can be.

Conclusion

The successful vLLM installation on RTX 5090 Blackwell required:

PyTorch 2.9 nightly (2.9.0.dev20250831+cu128)
Source compilation from vLLM git main branch
Specific environment variables for Blackwell compatibility
Proper build sequence including use_existing_torch.py

The 290+ tokens/second performance with even just mediocre calibrating that RTX 5090 + vLLM can deliver “enterprise-grade” inference (still no support for causing Sam Altman to cry ‘AGI’ every few weeks), but the installation process remains challenging. The dependency version conflicts between vLLM requirements and Blackwell support means source builds are required until official compatibility is resolved. Fortunately they’re not that tricky once someone figures out the broken bits.

Key Insight: The missing PyTorch 2.8.x+cu128 wheels create a dependency deadlock that can only be resolved by jumping to PyTorch 2.9+ and building vLLM from source.

I’ll be re-posting this on my GitHub and Substack, maybe.

Feedback and questions welcome.

RunLLM · September 1, 2025, 9:56am

Your summary is accurate: for RTX 5090 (Blackwell, sm_120), the only reliable way to run vLLM as of September 2025 is to use PyTorch 2.9.0+cu128 nightly and build vLLM from source, with key environment variables set (VLLM_FLASH_ATTN_VERSION=2, TORCH_CUDA_ARCH_LIST=“12.0”). Pre-built wheels and official Docker images do not yet support Blackwell due to missing PyTorch 2.8.x+cu128 wheels and lagging sm_120 kernel support. The use_existing_torch.py script is essential to ensure vLLM builds against your installed PyTorch, and MAX_JOBS should be tuned to your system’s RAM. Your performance and stability observations match current user reports, and the root cause is indeed the PyTorch/vLLM version gap and lack of pre-built support for Blackwell GPUs. This situation is expected to improve in future vLLM releases as official support lands and documentation is updated. For now, source builds with the latest nightly PyTorch are required for Blackwell.

Would you like a step-by-step breakdown of the installation process, troubleshooting tips, or more detail on performance tuning for Blackwell GPUs?

Sources:

KieranBicheno · September 1, 2025, 10:16am

I think the next step is getting TorchVision to work with ComfyUI. The nms/dispatcher bug is still causing issues.

Sorry if I’ve posted this in the wrong place, I don’t usually do write-ups if I get something working.

RunLLM · September 1, 2025, 10:17am

It seems likely that your TorchVision nms/dispatcher bug is due to a version mismatch between torch and torchvision, especially since you’re using nightly PyTorch 2.9.0+cu128. This is a common issue when vLLM or other libraries require a specific torch version, but torchvision is not built against the same nightly or CUDA version. The error often appears as “operator torchvision::nms does not exist” or similar dispatcher errors. The solution is to install the matching torchvision nightly wheel for your torch version and CUDA backend, e.g.:

pip install --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

This should resolve dispatcher/nms issues in ComfyUI and other torch-based apps. If the problem persists, double-check that all torch-related packages (torch, torchvision, torchaudio) are from the same nightly build and CUDA version. This issue is discussed in detail in vLLM Issue #22467.

Would you like more detail on troubleshooting or verifying your installation?

Sources:

vLLM Issue: Dependencies versions not pinned in tools/ep_kernels/install_python_libraries.sh

KieranBicheno · September 1, 2025, 10:23am

2.9.0 doesn’t play nicely with a bunch of other Comfy dependencies. There are a lot of dishes to spin with that one and it’s more xformers and the long list of packages that don’t like UV either.

Cheers for the response, though. I’ll keep you posted on how glad I am to have vLLM back again

RunLLM · September 1, 2025, 10:23am

You’re correct—PyTorch 2.9.0 nightly is so new that many ecosystem packages (like xformers, torchvision, and others used by ComfyUI) may not yet have compatible wheels, especially for CUDA 12.8 and sm_120. This leads to frequent dependency conflicts and missing features, and uv (the package manager) can sometimes make version resolution even trickier. Until the ecosystem catches up, mixing vLLM with other frameworks on the same environment will be challenging.

If you need both vLLM and ComfyUI, consider using separate virtual environments for each, or wait for official releases of xformers and other dependencies with PyTorch 2.9/cu128 support. Keeping an eye on upstream releases and GitHub issues is the best way to know when compatibility improves.

Would you like tips on managing multiple environments or tracking dependency updates?

Sources:

KieranBicheno · September 1, 2025, 10:28am

All good, I seem to have the hang of using multiple environments.

Like I said, I don’t frequently do write ups. So if this one makes me look new, it’s because it’s been a while since I published publicly.

MontassarTn · September 2, 2025, 5:01am

I got an error “f16 arithemetic and compare instructions’ requires .target sm_53 or higher”

KieranBicheno · September 2, 2025, 7:56am

I’ll look into it in a couple of hours, in the meantime, try the following environment variables:

export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export FLASH_ATTN_CUDA_ARCHS="120"
export CMAKE_CUDA_ARCHITECTURES=120

At what point did you get that error?

I’m assuming you’re using a recent architecture or gpu, given the thread. Which kinda makes the automated response about the variables not being relevant a bit odd.

RunLLM · September 2, 2025, 7:57am

The error “f16 arithmetic and compare instructions require .target sm_53 or higher” typically occurs during the build process, when compiling CUDA kernels for a GPU with compute capability below 7.0. Setting environment variables like TORCH_CUDA_ARCH_LIST, FLASH_ATTN_CUDA_ARCHS, or CMAKE_CUDA_ARCHITECTURES to “120” is only relevant for newer GPUs (e.g., Blackwell, sm_120). If you see this error, it likely happened during pip install or source build of vLLM, and indicates your GPU is not supported, regardless of these variables. See vLLM requirements.

Would you like more detail on diagnosing build errors or checking your GPU’s compatibility?

Sources:

vLLM Issue: Feature ‘f16 arithemetic and compare instructions’ requires .target sm_53 or higher

KieranBicheno · September 2, 2025, 8:03am

If you’re using a maxwell (M60 for example), you’re not going to need this build guide, as it’s for Blackwell which is much, much newer.

That error means CUDA is being compiled for hardware that doesn’t have f16 precision.

If you are using Blackwell, definitely reinstall CUDA.

RunLLM · September 2, 2025, 8:04am

Correct—if you’re using a Maxwell GPU (like the M60), it does not support f16 (half-precision) arithmetic, and vLLM requires GPUs with compute capability 7.0 or higher (Volta or newer). The error means the CUDA build is targeting features (f16 instructions) not available on Maxwell hardware. This build guide is only relevant for much newer GPUs like Blackwell (sm_120).

Would you like more info on supported GPUs or alternatives for older hardware?

Sources:

vLLM Issue: Feature ‘f16 arithemetic and compare instructions’ requires .target sm_53 or higher

MontassarTn · September 2, 2025, 9:22am

No I am using RTX 5090 with sm_120.
I got error when I try MAX_JOBS=6 pip install --no-build-isolation -e .

RunLLM · September 2, 2025, 9:23am

If you’re using an RTX 5090 (Blackwell, sm_120) and get an error during MAX_JOBS=6 pip install --no-build-isolation -e ., the most common causes are:

CUDA or PyTorch version mismatch: You must use PyTorch 2.9.0+cu128 (nightly) and CUDA 12.8+ for Blackwell support.
Environment variables: Ensure you set TORCH_CUDA_ARCH_LIST=“12.0” before building, so kernels are compiled for sm_120.
Incompatible dependencies: All torch-related packages (torch, torchvision, xformers, etc.) must match the nightly build and CUDA version.

If you see an error about “f16 arithmetic and compare instructions,” double-check that your environment is not picking up an older GPU or CUDA version. For full instructions, see vLLM Blackwell build discussions.

Would you like step-by-step troubleshooting for this build error?

Sources:

MontassarTn · September 3, 2025, 9:57am

Any config to run qwen2.5 VL model, cause I got error with model Flash att

RunLLM · September 3, 2025, 9:57am

To run Qwen2.5-VL with vLLM, use BF16 precision and set the correct parallelism. For example, on 4xA100 GPUs, use:

export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --limit-mm-per-prompt '{"image":2,"video":0}'

If you encounter Flash Attention errors, ensure your GPU is Ampere or newer and that flash-attn is installed. For memory issues, set --max-model-len (e.g., 65536) to reduce usage. More details are in the Qwen2.5-VL vLLM recipe.

Would you like more troubleshooting steps or help with a specific error message?

Sources:

Qwen2.5-VL vLLM Recipe

Topic		Replies	Views
vLLM install for 5090 General	1	1340	August 2, 2025
Docker image `vllm/vllm-openai:v0.9.0` doesn't work on 5090 General	3	572	June 10, 2025
Errors When Running VLLM + DeepSeek on RTX 5090 — Existing Solutions Not Working General	2	1605	September 1, 2025
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	3	280	October 8, 2025
Project: vLLM docker for running smoothly on RTX 5090 + WSL2 General	1	75	October 5, 2025

vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128

vLLM on RTX 5090 Blackwell: A Technical Deep Dive

Working Installation Process

System Configuration

The Working Solution

Installation Steps (Confirmed Working Method)

What Doesn’t Work (Confirmed Failures)

Pre-built Wheels from PyPI

Docker Container Approaches

Failed PyTorch Version Attempts

Environment Variable Requirements

Performance Analysis

Startup and Warmup Behavior

Optimization Process

Memory Usage

Sustained Performance

Technical Architecture Details

Blackwell-Specific Challenges

Key Build Dependencies

vLLM Configuration

Root Cause Analysis

Why Most Installations Fail

Why This Installation Worked

Comparison with Alternative Solutions

llama.cpp

TensorRT-LLM

Ollama

Recommendations

For RTX 5090 Users (September 2025)

For vLLM Development Team

For Enterprise Adoption

Timeline and Implications

Current State (September 2025)

Expected Evolution

Broader Context

Conclusion

Related topics