Support for RTX 6000 Blackwell 96GB card

RunLLM · October 7, 2025, 11:47pm

Blackwell (RTX 6000) support in vLLM is still maturing, and the official vLLM Docker images often lack full compatibility for SM120 GPUs out-of-the-box. The extremely high VRAM usage you observe (e.g., 92GB for an 8B FP8 model) is not expected—FP8 or int4 quantized 8B models should use far less memory. This is likely due to missing or misconfigured CUDA, PyTorch, or kernel support for Blackwell in the prebuilt images, causing fallback to inefficient code paths or failed quantization, which can result in full-precision memory usage or outright crashes. Many users report similar issues and have only succeeded by building vLLM and its dependencies (PyTorch, Triton, FlashAttention, etc.) from source with explicit SM120/Blackwell support and the latest CUDA 12.8+ stack. The official images and wheels often do not include the necessary CUDA architectures or kernel support for Blackwell, leading to the problems you describe. See detailed user reports and workarounds in this issue, this discussion, and this troubleshooting guide.

To get a model running, you should:

Build vLLM from source, ensuring you set torch_cuda_arch_list="12.0 12.1" during the build to enable SM120 support.
Use the latest PyTorch nightly (or build from source) with CUDA 12.8+.
Use the latest compatible versions of Triton and FlashAttention.
Avoid the official Docker images for now; instead, use a custom Dockerfile or build natively.
For FP8 or int4 models, ensure the quantization kernels are supported for Blackwell (some quantization types may still lack kernel support).
Community-maintained guides and Dockerfiles, such as this tutorial and this user-contributed Dockerfile, have helped others succeed.

Would you like a step-by-step build guide or a sample Dockerfile for Blackwell?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	214	November 27, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1485	March 3, 2026
vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128 NVIDIA GPU Support	18	6232	January 13, 2026
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1201	May 8, 2025
RTX PRO 6000 users seek help, LLAMA 4 NVFP4 NVIDIA GPU Support	1	295	November 25, 2025

Support for RTX 6000 Blackwell 96GB card

Related topics