vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128

vLLM on RTX 5090 Blackwell: A Technical Deep Dive

Hi folks, I spent most of today trying to get vLLM running with PyTorch 2.9.0 and it looks like the most recent build takes care of a lot of errors.

There are so many ways to get this wrong and I’m amazed it worked at all. I think I hit every issue on this forum to get to this point. I hope it helps anyone else working on the same issue to get things running.

Working Installation Process

System Configuration

  • GPU: NVIDIA GeForce RTX 5090 GB202 (Blackwell sm_120 architecture)
  • Driver: NVIDIA 575.64.03
  • CUDA: 12.8 (automatically supported by driver)
  • OS: Ubuntu 25.04 Plucky Puffin
  • RAM: 192 GB
  • CPU: AMD Ryzen 9 9900X3D (12-core)
  • Python: 3.12.10 (in virtual environment)

The Working Solution

The successful installation used PyTorch 2.9 nightly + vLLM source build:

vllm --version
# Output: 0.10.1rc2.dev413+g5438967fb.d20250901

This is a September 1, 2025 development build compiled from source with PyTorch 2.9.0.dev20250831+cu128.

Installation Steps (Confirmed Working Method)

  1. Environment Setup
uv venv --python 3.12.10 --seed
source .venv/bin/activate
  1. PyTorch 2.9 Nightly Installation
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
# Result: torch==2.9.0.dev20250831+cu128
  1. vLLM Source Build
gh repo clone vllm-project/vllm    # This can just as easily be cloning from ssh
cd vllm
python use_existing_torch.py     # This line seems to be working now and it's important
pip install -r requirements/build.txt
export VLLM_FLASH_ATTN_VERSION=2
export TORCH_CUDA_ARCH_LIST="12.0"
MAX_JOBS=6 pip install --no-build-isolation -e .
  1. Verification
vllm --version
# Should show: 0.10.1rc2.dev413+g5438967fb.d20250901

What Doesn’t Work (Confirmed Failures)

Pre-built Wheels from PyPI

  • All official releases: Lack Blackwell sm_120 support
  • Dependency conflicts: Version mismatches with PyTorch requirements
  • Missing wheel combinations: No PyTorch 2.8.x+cu128 wheels exist

Docker Container Approaches

  • All official images: Not ready for Blackwell architecture
  • NVIDIA NGC containers: Contain outdated vLLM versions incompatible with RTX 5090

Failed PyTorch Version Attempts

  • PyTorch 2.7.x: Only supports CUDA 12.6, incompatible with Blackwell sm_120
  • PyTorch 2.8.x: No cu128 wheels available anywhere (404 errors on GitHub releases)
  • Dependency resolution conflicts: vLLM nightly wants torch==2.8.0, but this doesn’t exist with cu128

Environment Variable Requirements

The successful build required specific Blackwell-targeting variables:

VLLM_FLASH_ATTN_VERSION=2  # FA3 unsupported on Blackwell
TORCH_CUDA_ARCH_LIST="12.0"  # Blackwell sm_120 architecture
MAX_JOBS=6  # Memory management during compilation

Performance Analysis

Startup and Warmup Behavior

Initial load: ~72 seconds (model download and compilation)
First request: 19.9 tokens/s (cold start)
Second request: 81.6 tokens/s (warming up)
Subsequent requests: 50-300+ tokens/s (stabilized)

Optimization Process

  1. CUDA Graphs: Progressive optimization with request patterns
  2. torch.compile: JIT compilation of hot code paths (3.36s initial compilation)
  3. KV Cache: Memory utilization grows from 0.0% to 0.2% as patterns stabilize
  4. Flash Attention: Using FA backend on V1 engine (FA3 not supported on Blackwell)

Memory Usage

  • Qwen2.5-7B: 31GB VRAM usage (aggressive KV cache pre-allocation)
  • Available KV cache: 27.09 GiB
  • Maximum concurrency: 288.98x for 1,024 token requests
  • Graph capturing: Additional 0.44 GiB

Sustained Performance

  • Decode speed: 290+ tokens/second (Qwen2.5-7B)
  • Response quality: Proper model behavior with appropriate stopping
  • Stability: No crashes or memory issues during extended testing

Technical Architecture Details

Blackwell-Specific Challenges

  • sm_120 compute capability: Newer than most software expects
  • CUDA 12.8 minimum requirement: Software ecosystem lagging behind hardware
  • Flash Attention limitations: FA3 unavailable, must use FA2
  • Kernel availability: Many operations lack optimized kernels for sm_120

Key Build Dependencies

From the successful installation, these packages were critical:

  • PyTorch: 2.9.0.dev20250831+cu128 (nightly build)
  • CUDA libraries: All 12.8.x versions (cublas, cudnn, etc.)
  • Build tools: cmake 4.1.0, ninja 1.13.0, setuptools-scm 9.2.0
  • Ray: 2.49.0 with cgraph support (cupy-cuda12x dependency)

vLLM Configuration

Model loading: 0.6611 GiB, 72.403091 seconds
Chunked prefill: enabled with max_num_batched_tokens=2048
Compilation level: 3 (highest optimization)
CUDA graphs: enabled with 67 capture sizes
Backend: Flash Attention V1 engine

Root Cause Analysis

Why Most Installations Fail

  1. PyTorch Version Gap: No PyTorch 2.8.x+cu128 wheels exist, creating dependency deadlock
  2. vLLM Version Constraints: Nightly builds expect torch==2.8.0 but must use torch>=2.9.0 for Blackwell (No idea why)
  3. Architecture Support Lag: sm_120 support very recent, not in stable releases
  4. Build Environment Requirements: Specific environment variables needed for Blackwell compilation

Why This Installation Worked

  1. Source compilation: Bypassed pre-built wheel dependency conflicts
  2. PyTorch 2.9 nightly: Only version with functional CUDA 12.8+sm_120 support
  3. use_existing_torch.py: Critical script that cleaned dependency files to use existing PyTorch
  4. Proper environment variables: VLLM_FLASH_ATTN_VERSION=2 and TORCH_CUDA_ARCH_LIST=“12.0”
  5. Sufficient resources: 192GB RAM prevented memory-related build failures (change MAX_JOBS env variable for tighter RAM budgets)

Comparison with Alternative Solutions

I’ve only tested Ollama on this rig, but the documentation out there hints that this is a vLLM problem to a degree.

llama.cpp

  • Status: Confirmed working with pre-built wheels
  • Performance: 700+ tokens/sec prefill, good decode performance
  • Setup: Significantly easier, no dependency conflicts
  • Use case: Better choice for most users until vLLM stabilizes

TensorRT-LLM

  • Status: Working but requires complex setup
  • Performance: Highest potential (FP4 optimization)
  • Setup: Requires building from source, NVIDIA-specific optimizations
  • Use case: Best for maximum performance, enterprise deployments

Ollama

  • Status: Works reliably
  • Performance: Moderate (baseline comparison)
  • Setup: Trivial installation
  • Use case: Good fallback option, proven stability

Recommendations

For RTX 5090 Users (September 2025)

  1. Try llama.cpp first: Most reliable path to working inference
  2. For vLLM: Use the exact source build method documented above
  3. Avoid pre-built wheels: All fail due to PyTorch version conflicts
  4. Monitor development: New versions may resolve dependency issues

For vLLM Development Team

  1. Update dependency constraints: Support PyTorch 2.9+ in nightly builds
  2. Improve Blackwell documentation: Current guides don’t address sm_120 specifics
  3. Pre-built wheel support: Provide wheels compiled with PyTorch 2.9+cu128
  4. Environment detection: Auto-set VLLM_FLASH_ATTN_VERSION=2 for Blackwell GPUs

For Enterprise Adoption

  1. Source build required: No stable pre-built solution exists
  2. Test thoroughly: Performance characteristics still stabilizing
  3. Monitor memory usage: Current builds are memory-aggressive
  4. Have fallback plans: Keep alternative inference engines available

Timeline and Implications

Current State (September 2025)

  • Limited working installations: Mostly individual researchers/developers using source builds
  • No enterprise adoption: Production deployments require custom compilation
  • Blackwell support is improving as we test it: Weekly improvements in compatibility

Expected Evolution

  • Official support: Likely within 1-2 stable releases
  • Performance optimization: Memory usage and speed improvements
  • Documentation: Better guides for Blackwell-specific setup

Broader Context

This installation represents early adoption of new hardware with new software. While functional, it’s a sign of how legitimately painful running at the front of AI infrastructure capabilities can be.

Conclusion

The successful vLLM installation on RTX 5090 Blackwell required:

  • PyTorch 2.9 nightly (2.9.0.dev20250831+cu128)
  • Source compilation from vLLM git main branch
  • Specific environment variables for Blackwell compatibility
  • Proper build sequence including use_existing_torch.py

The 290+ tokens/second performance with even just mediocre calibrating that RTX 5090 + vLLM can deliver “enterprise-grade” inference (still no support for causing Sam Altman to cry ‘AGI’ every few weeks), but the installation process remains challenging. The dependency version conflicts between vLLM requirements and Blackwell support means source builds are required until official compatibility is resolved. Fortunately they’re not that tricky once someone figures out the broken bits.

Key Insight: The missing PyTorch 2.8.x+cu128 wheels create a dependency deadlock that can only be resolved by jumping to PyTorch 2.9+ and building vLLM from source.


I’ll be re-posting this on my GitHub and Substack, maybe.

Feedback and questions welcome.

Your summary is accurate: for RTX 5090 (Blackwell, sm_120), the only reliable way to run vLLM as of September 2025 is to use PyTorch 2.9.0+cu128 nightly and build vLLM from source, with key environment variables set (VLLM_FLASH_ATTN_VERSION=2, TORCH_CUDA_ARCH_LIST=“12.0”). Pre-built wheels and official Docker images do not yet support Blackwell due to missing PyTorch 2.8.x+cu128 wheels and lagging sm_120 kernel support. The use_existing_torch.py script is essential to ensure vLLM builds against your installed PyTorch, and MAX_JOBS should be tuned to your system’s RAM. Your performance and stability observations match current user reports, and the root cause is indeed the PyTorch/vLLM version gap and lack of pre-built support for Blackwell GPUs. This situation is expected to improve in future vLLM releases as official support lands and documentation is updated. For now, source builds with the latest nightly PyTorch are required for Blackwell.

Would you like a step-by-step breakdown of the installation process, troubleshooting tips, or more detail on performance tuning for Blackwell GPUs?

Sources:

I think the next step is getting TorchVision to work with ComfyUI. The nms/dispatcher bug is still causing issues.

Sorry if I’ve posted this in the wrong place, I don’t usually do write-ups if I get something working.

It seems likely that your TorchVision nms/dispatcher bug is due to a version mismatch between torch and torchvision, especially since you’re using nightly PyTorch 2.9.0+cu128. This is a common issue when vLLM or other libraries require a specific torch version, but torchvision is not built against the same nightly or CUDA version. The error often appears as “operator torchvision::nms does not exist” or similar dispatcher errors. The solution is to install the matching torchvision nightly wheel for your torch version and CUDA backend, e.g.:

pip install --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

This should resolve dispatcher/nms issues in ComfyUI and other torch-based apps. If the problem persists, double-check that all torch-related packages (torch, torchvision, torchaudio) are from the same nightly build and CUDA version. This issue is discussed in detail in vLLM Issue #22467.

Would you like more detail on troubleshooting or verifying your installation?

Sources:

2.9.0 doesn’t play nicely with a bunch of other Comfy dependencies. There are a lot of dishes to spin with that one and it’s more xformers and the long list of packages that don’t like UV either.

Cheers for the response, though. I’ll keep you posted on how glad I am to have vLLM back again :smiley:

You’re correct—PyTorch 2.9.0 nightly is so new that many ecosystem packages (like xformers, torchvision, and others used by ComfyUI) may not yet have compatible wheels, especially for CUDA 12.8 and sm_120. This leads to frequent dependency conflicts and missing features, and uv (the package manager) can sometimes make version resolution even trickier. Until the ecosystem catches up, mixing vLLM with other frameworks on the same environment will be challenging.

If you need both vLLM and ComfyUI, consider using separate virtual environments for each, or wait for official releases of xformers and other dependencies with PyTorch 2.9/cu128 support. Keeping an eye on upstream releases and GitHub issues is the best way to know when compatibility improves.

Would you like tips on managing multiple environments or tracking dependency updates?

Sources:

1 Like

All good, I seem to have the hang of using multiple environments.

Like I said, I don’t frequently do write ups. So if this one makes me look new, it’s because it’s been a while since I published publicly.

I got an error “f16 arithemetic and compare instructions’ requires .target sm_53 or higher”

1 Like

I’ll look into it in a couple of hours, in the meantime, try the following environment variables:

export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export FLASH_ATTN_CUDA_ARCHS="120"
export CMAKE_CUDA_ARCHITECTURES=120

At what point did you get that error?

I’m assuming you’re using a recent architecture or gpu, given the thread. Which kinda makes the automated response about the variables not being relevant a bit odd.

The error “f16 arithmetic and compare instructions require .target sm_53 or higher” typically occurs during the build process, when compiling CUDA kernels for a GPU with compute capability below 7.0. Setting environment variables like TORCH_CUDA_ARCH_LIST, FLASH_ATTN_CUDA_ARCHS, or CMAKE_CUDA_ARCHITECTURES to “120” is only relevant for newer GPUs (e.g., Blackwell, sm_120). If you see this error, it likely happened during pip install or source build of vLLM, and indicates your GPU is not supported, regardless of these variables. See vLLM requirements.

Would you like more detail on diagnosing build errors or checking your GPU’s compatibility?

Sources:

If you’re using a maxwell (M60 for example), you’re not going to need this build guide, as it’s for Blackwell which is much, much newer.

That error means CUDA is being compiled for hardware that doesn’t have f16 precision.

If you are using Blackwell, definitely reinstall CUDA.

Correct—if you’re using a Maxwell GPU (like the M60), it does not support f16 (half-precision) arithmetic, and vLLM requires GPUs with compute capability 7.0 or higher (Volta or newer). The error means the CUDA build is targeting features (f16 instructions) not available on Maxwell hardware. This build guide is only relevant for much newer GPUs like Blackwell (sm_120).

Would you like more info on supported GPUs or alternatives for older hardware?

Sources:

No I am using RTX 5090 with sm_120.
I got error when I try MAX_JOBS=6 pip install --no-build-isolation -e .

If you’re using an RTX 5090 (Blackwell, sm_120) and get an error during MAX_JOBS=6 pip install --no-build-isolation -e ., the most common causes are:

  1. CUDA or PyTorch version mismatch: You must use PyTorch 2.9.0+cu128 (nightly) and CUDA 12.8+ for Blackwell support.
  2. Environment variables: Ensure you set TORCH_CUDA_ARCH_LIST=“12.0” before building, so kernels are compiled for sm_120.
  3. Incompatible dependencies: All torch-related packages (torch, torchvision, xformers, etc.) must match the nightly build and CUDA version.

If you see an error about “f16 arithmetic and compare instructions,” double-check that your environment is not picking up an older GPU or CUDA version. For full instructions, see vLLM Blackwell build discussions.

Would you like step-by-step troubleshooting for this build error?

Sources:

Any config to run qwen2.5 VL model, cause I got error with model Flash att

To run Qwen2.5-VL with vLLM, use BF16 precision and set the correct parallelism. For example, on 4xA100 GPUs, use:

export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --limit-mm-per-prompt '{"image":2,"video":0}'

If you encounter Flash Attention errors, ensure your GPU is Ampere or newer and that flash-attn is installed. For memory issues, set --max-model-len (e.g., 65536) to reduce usage. More details are in the Qwen2.5-VL vLLM recipe.

Would you like more troubleshooting steps or help with a specific error message?

Sources: