Support for RTX 6000 Blackwell 96GB card

ThommyB · October 7, 2025, 11:47pm

I ran a RTX4500 Nvidia Ada Type card with 24 GB vRAM. The memory is small but I could load several 7B or 8B models with FP8 or int4. Since I like to run several LLMs in on card I ordered a RTX 6000 Blackwell with 96 GB vRAM. And I am very disappointed, since I do have only problems. I was not able to get one LLM running!

So I have a Linux box running Rocky v9. It has a Ryzen 9900 CPU, 128 GB RAM and 2 Samsung 990 Pro NVMe SSDs. I use Podman (thats what RHEL uses instead of docker) as container engine. I was using vLLM 0.10.2 from Docker.io and upgraded yesterday to Version 0.11.0.

There is no difference for my problem.

I tried the swiss-ai/apertus 8B model which is offered from RedHat under RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic with FP8 quantization.

When I load it, it takes forever (one minute or more) and seems to fill the entire memory. So an 8B model consumes about 92 GB vRAM! Really???

I tried to tweek around with some parameters, to limit the vRAM consumption to 50%. Then it consumed 50 GB vRAM.

Unfortunately I have no deeper understanding of all the software which is used, I know it is complicated. I am just a user, that develops software ON TOP of such vLLM instances.

So I wonder: Was my CHF 6500 investment for the Blackwell card for nuts when using vLLM? Is it too early for Blackwell?

It worked perfectly well with the Ada card, but 24 GB vRAM is definitely too small, I wanted to test the Apertus 70B model which is also offered from RedHatAI in a w4a16 quantization version. But no luck at all, the standard vLLM docker image seems not to support it. Sometimes I can see it receives some tokens and tries to return some tokens but they do not arrive, crash or just hanging vLLM.

So if somebody could just tell me, if I have any chance to get any model (tell me which one) up and running with my Blackwell card, that would be great. Or if I need to build an image myself with newer, but more experimental software please let me know as well. I now that Blackwell is a new architecture, but I did not expect so many problems…

RunLLM · October 7, 2025, 11:47pm

Blackwell (RTX 6000) support in vLLM is still maturing, and the official vLLM Docker images often lack full compatibility for SM120 GPUs out-of-the-box. The extremely high VRAM usage you observe (e.g., 92GB for an 8B FP8 model) is not expected—FP8 or int4 quantized 8B models should use far less memory. This is likely due to missing or misconfigured CUDA, PyTorch, or kernel support for Blackwell in the prebuilt images, causing fallback to inefficient code paths or failed quantization, which can result in full-precision memory usage or outright crashes. Many users report similar issues and have only succeeded by building vLLM and its dependencies (PyTorch, Triton, FlashAttention, etc.) from source with explicit SM120/Blackwell support and the latest CUDA 12.8+ stack. The official images and wheels often do not include the necessary CUDA architectures or kernel support for Blackwell, leading to the problems you describe. See detailed user reports and workarounds in this issue, this discussion, and this troubleshooting guide.

To get a model running, you should:

Build vLLM from source, ensuring you set torch_cuda_arch_list="12.0 12.1" during the build to enable SM120 support.
Use the latest PyTorch nightly (or build from source) with CUDA 12.8+.
Use the latest compatible versions of Triton and FlashAttention.
Avoid the official Docker images for now; instead, use a custom Dockerfile or build natively.
For FP8 or int4 models, ensure the quantization kernels are supported for Blackwell (some quantization types may still lack kernel support).
Community-maintained guides and Dockerfiles, such as this tutorial and this user-contributed Dockerfile, have helped others succeed.

Would you like a step-by-step build guide or a sample Dockerfile for Blackwell?

Sources:

Hint: Mention @RunLLM in the post for followups.

ThommyB · October 8, 2025, 5:49am

@RunLLM

Yes please provide such a step-by-step build guide and a sample Dockerfile.

While waiting for approval of my post, I got a bit further. I followed a recipe from NVIDIA here. This allowed me to load the nvidia/Llama-3.3-70B-Instruct-FP4 in my GPU, and it consumes around 65 GG vRAM as can be seen here:

nvidia-smi
Wed Oct  8 07:23:48 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
±----------------------------------------±-----------------------±---------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac…    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8              3W /  300W |   65852MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           91857      C   VLLM::Worker                          65842MiB |
±----------------------------------------------------------------------------------------+

However, the generated text is broken when I send a very simple prompt through curl:

curl http://0.0.0.0:8003/v1/chat/completions   -H “Content-Type: application/json”   -d ‘{
“model”: “nvidia/Llama-3.3-70B-Instruct-FP4”,
“messages”: \[{“role”: “user”, “content”: “Explain the main features of the Blackwell GPU architecture.”}\],
“max_tokens”: 128
}’ | jq
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100  1309  100  1121  100   188    168     28  0:00:06  0:00:06 --:–:--   252
{
“id”: “chatcmpl-33a760382a4246c49b4a4a86254b671d”,
“object”: “chat.completion”,
“created”: 1759861691,
“model”: “nvidia/Llama-3.3-70B-Instruct-FP4”,
“choices”: \[
{
“index”: 0,
“message”: {
“role”: “assistant”,
“content”: “The the original\\nThe the management of the whole\\nThe the mill\\nThe the end (200\\\\\\\\\\nThe ,  a second\\\\\\\\\\nIt\\\\\\\\\\nThe the parent\\nThe a minute by\\nThe and the national security and exchange.\\nThe, and Maldives\\nThe\\\\\\\\\\nThe . the actual\\nassistant\\n\\nThe\\\\\\\\\\nassistant\\n\\n1. Schmidt et . . (West\\nUnderstand up\\nWhen  to\\n\\nThe a must be\\\\\\\\\\nassistant\\n\\nThe\\\\\\\\\\n##The\\\\\\\\\\nassistant\\n\\nWe have to\\n\\nAs a teacher for, ,  a year  the World Health\\nThe the\\n The legal”,
“refusal”: null,
“annotations”: null,
“audio”: null,
“function_call”: null,
“tool_calls”: [ ],
“reasoning_content”: null
},
“logprobs”: null,
“finish_reason”: “length”,
“stop_reason”: null,
“token_ids”: null
}
\],
“service_tier”: null,
“system_fingerprint”: null,
“usage”: {
“prompt_tokens”: 47,
“total_tokens”: 175,
“completion_tokens”: 128,
“prompt_tokens_details”: null
},
“prompt_logprobs”: null,
“prompt_token_ids”: null,
“kv_transfer_params”: null
}

The build log contains some interesting hints though but most of it is not understandable for me. I paste some parts here for anybody who might be interested:

INFO 10-07 12:03:09 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1) INFO 10-07 12:03:11 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1) INFO 10-07 12:03:11 [utils.py:233] non-default args: {'model': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'tokenizer': '/models/nvidia/Llama-3.3-70B-Instruct-FP4', 'trust_remote_code': True, 'max_model_len': 10240, 'gpu_memory_utilization': 0.85, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 512, 'async_scheduling': True, 'compilation_config': {"level":null,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["+quant_fp8","+rms_norm"],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,0],"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fi_allreduce_fusion":true},"max_capture_size":null,"local_cache_dir":null}}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 10-07 12:03:12 [model.py:547] Resolved architecture: LlamaForCausalLM
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 10-07 12:03:12 [model.py:1510] Using max model len 10240
(APIServer pid=1) INFO 10-07 12:03:13 [cache.py:180] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 10-07 12:03:14 [arg_utils.py:1293] Defaulting to mp-based distributed executor backend for async scheduling.
(APIServer pid=1) INFO 10-07 12:03:14 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 10-07 12:03:14 [modelopt.py:626] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 10-07 12:03:14 [compilation.py:598] Using piecewise compilation with empty splitting_ops and use_inductor_graph_partition=False.
INFO 10-07 12:03:16 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=101) INFO 10-07 12:03:17 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=101) INFO 10-07 12:03:17 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='nvidia/Llama-3.3-70B-Instruct-FP4', speculative_config=None, tokenizer='/models/nvidia/Llama-3.3-70B-Instruct-FP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10240, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=nvidia/Llama-3.3-70B-Instruct-FP4, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["+quant_fp8","+rms_norm"],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,0],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fi_allreduce_fusion":true},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=101) WARNING 10-07 12:03:17 [multiproc_executor.py:720] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=101) INFO 10-07 12:03:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_bd917c15'), local_subscribe_addr='ipc:///tmp/15d6c171-f940-4d86-acef-44e17dc1c950', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-07 12:03:18 [__init__.py:216] Automatically detected platform cuda.
W1007 12:03:20.625000 155 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1007 12:03:20.625000 155 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 10-07 12:03:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_b2aa7488'), local_subscribe_addr='ipc:///tmp/3767a4fe-9e5f-4d93-9e2c-0757b9e78e30', remote_subscribe_addr=None, remote_addr_ipv6=False)
...
There are more warnings in the log. If someone likes the whole log for whatever reasons, I can paste it, but its pretty long...

So I tried all this with vLLM docker version 0.11.0.
If I get the Dockerfile, I will try a full compile with a nightly version and see, if it gets better results in receiving answers from a prompt. First I have to read through all the provided links.

RunLLM · October 8, 2025, 5:50am

Here’s a step-by-step guide and a sample Dockerfile for running vLLM on Blackwell (RTX 6000) with proper SM120 support. This approach is based on community-verified solutions and official vLLM guidance for new NVIDIA architectures. Your current issues (broken output, high VRAM, etc.) are common when using prebuilt images that lack full Blackwell/FP4/FP8 kernel support. Building from source with the right CUDA arch flags and latest dependencies is essential.

Step-by-Step Build Guide for vLLM on Blackwell (SM120):

Install Prerequisites on Host
- Make sure you have CUDA 12.8+ and the latest NVIDIA driver (580+).
- Install Podman or Docker (Podman is fine for RHEL/Rocky).
- Install git and python3.12 (or 3.10/3.11).

Clone vLLM Source

git clone https://github.com/vllm-project/vllm.git
cd vllm

Sample Dockerfile for Blackwell (SM120)
Save this as Dockerfile.blackwell in your vLLM directory:

FROM nvcr.io/nvidia/pytorch:25.05-py3

# Set CUDA arch for Blackwell (SM120)
ARG torch_cuda_arch_list="12.0 12.1"
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

# Install build tools and dependencies
RUN apt-get update && apt-get install -y git cmake ninja-build ccache && rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip setuptools_scm packaging

# Clone vLLM (if not using build context)
# RUN git clone https://github.com/vllm-project/vllm.git /workspace/vllm
WORKDIR /workspace/vllm

# Build vLLM with correct CUDA arch
RUN python3 use_existing_torch.py
RUN pip install -r requirements/build.txt
RUN pip install -r requirements/common.txt
RUN pip install --no-build-isolation -e .

# Optional: Install latest transformers
RUN pip install --upgrade git+https://github.com/huggingface/transformers.git

# Expose API port
EXPOSE 8000

Build the Docker Image

podman build -f Dockerfile.blackwell -t vllm-blackwell:latest --build-arg torch_cuda_arch_list="12.0 12.1"

Run the Container

podman run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-blackwell:latest \
  --model nvidia/Llama-3.3-70B-Instruct-FP4 \
  --max-model-len 10240 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8

Test with a Prompt
Use the OpenAI-compatible API endpoint as you did before.

Notes & Troubleshooting:

If you see broken output, try a different quantized model (e.g., RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic) or use a non-quantized model to verify the pipeline.
For FP8/FP4, not all models or quantization types are fully supported on Blackwell yet. Some kernel support is still experimental.
If you see warnings about CUDA archs, ensure TORCH_CUDA_ARCH_LIST is set to "12.0 12.1" during build.
If you want to use a nightly PyTorch, change the base image to a nightly PyTorch container or install nightly PyTorch in the Dockerfile.

Would you like a more advanced Dockerfile (with custom quantization or nightly PyTorch), or help with debugging broken output?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128 NVIDIA GPU Support	15	1100	September 3, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	573	May 8, 2025
Project: vLLM docker for running smoothly on RTX 5090 + WSL2 General	1	54	October 5, 2025
Docker explosion this morning after it worked fine for a long while NVIDIA GPU Support	6	306	May 6, 2025
Make install easier General	11	183	July 24, 2025

Support for RTX 6000 Blackwell 96GB card

Related topics