Support for RTX 6000 Blackwell 96GB card

Blackwell (RTX 6000) support in vLLM is still maturing, and the official vLLM Docker images often lack full compatibility for SM120 GPUs out-of-the-box. The extremely high VRAM usage you observe (e.g., 92GB for an 8B FP8 model) is not expected—FP8 or int4 quantized 8B models should use far less memory. This is likely due to missing or misconfigured CUDA, PyTorch, or kernel support for Blackwell in the prebuilt images, causing fallback to inefficient code paths or failed quantization, which can result in full-precision memory usage or outright crashes. Many users report similar issues and have only succeeded by building vLLM and its dependencies (PyTorch, Triton, FlashAttention, etc.) from source with explicit SM120/Blackwell support and the latest CUDA 12.8+ stack. The official images and wheels often do not include the necessary CUDA architectures or kernel support for Blackwell, leading to the problems you describe. See detailed user reports and workarounds in this issue, this discussion, and this troubleshooting guide.

To get a model running, you should:

  • Build vLLM from source, ensuring you set torch_cuda_arch_list="12.0 12.1" during the build to enable SM120 support.
  • Use the latest PyTorch nightly (or build from source) with CUDA 12.8+.
  • Use the latest compatible versions of Triton and FlashAttention.
  • Avoid the official Docker images for now; instead, use a custom Dockerfile or build natively.
  • For FP8 or int4 models, ensure the quantization kernels are supported for Blackwell (some quantization types may still lack kernel support).
  • Community-maintained guides and Dockerfiles, such as this tutorial and this user-contributed Dockerfile, have helped others succeed.

Would you like a step-by-step build guide or a sample Dockerfile for Blackwell?

Sources:

Hint: Mention @RunLLM in the post for followups.