@RunLLM
Yes please provide such a step-by-step build guide and a sample Dockerfile.
While waiting for approval of my post, I got a bit further. I followed a recipe from NVIDIA here. This allowed me to load the nvidia/Llama-3.3-70B-Instruct-FP4 in my GPU, and it consumes around 65 GG vRAM as can be seen here:
nvidia-smi
Wed Oct 8 07:23:48 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac… Off | 00000000:01:00.0 Off | Off |
| 30% 35C P8 3W / 300W | 65852MiB / 97887MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 91857 C VLLM::Worker 65842MiB |
±----------------------------------------------------------------------------------------+
However, the generated text is broken when I send a very simple prompt through curl
:
curl http://0.0.0.0:8003/v1/chat/completions -H “Content-Type: application/json” -d ‘{
“model”: “nvidia/Llama-3.3-70B-Instruct-FP4”,
“messages”: \[{“role”: “user”, “content”: “Explain the main features of the Blackwell GPU architecture.”}\],
“max_tokens”: 128
}’ | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1309 100 1121 100 188 168 28 0:00:06 0:00:06 --:–:-- 252
{
“id”: “chatcmpl-33a760382a4246c49b4a4a86254b671d”,
“object”: “chat.completion”,
“created”: 1759861691,
“model”: “nvidia/Llama-3.3-70B-Instruct-FP4”,
“choices”: \[
{
“index”: 0,
“message”: {
“role”: “assistant”,
“content”: “The the original\\nThe the management of the whole\\nThe the mill\\nThe the end (200\\\\\\\\\\nThe , a second\\\\\\\\\\nIt\\\\\\\\\\nThe the parent\\nThe a minute by\\nThe and the national security and exchange.\\nThe, and Maldives\\nThe\\\\\\\\\\nThe . the actual\\nassistant\\n\\nThe\\\\\\\\\\nassistant\\n\\n1. Schmidt et . . (West\\nUnderstand up\\nWhen to\\n\\nThe a must be\\\\\\\\\\nassistant\\n\\nThe\\\\\\\\\\n##The\\\\\\\\\\nassistant\\n\\nWe have to\\n\\nAs a teacher for, , a year the World Health\\nThe the\\n The legal”,
“refusal”: null,
“annotations”: null,
“audio”: null,
“function_call”: null,
“tool_calls”: [ ],
“reasoning_content”: null
},
“logprobs”: null,
“finish_reason”: “length”,
“stop_reason”: null,
“token_ids”: null
}
\],
“service_tier”: null,
“system_fingerprint”: null,
“usage”: {
“prompt_tokens”: 47,
“total_tokens”: 175,
“completion_tokens”: 128,
“prompt_tokens_details”: null
},
“prompt_logprobs”: null,
“prompt_token_ids”: null,
“kv_transfer_params”: null
}
The build log contains some interesting hints though but most of it is not understandable for me. I paste some parts here for anybody who might be interested:
INFO 10-07 12:03:09 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1) INFO 10-07 12:03:11 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1) INFO 10-07 12:03:11 [utils.py:233] non-default args: {'model': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'tokenizer': '/models/nvidia/Llama-3.3-70B-Instruct-FP4', 'trust_remote_code': True, 'max_model_len': 10240, 'gpu_memory_utilization': 0.85, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 512, 'async_scheduling': True, 'compilation_config': {"level":null,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["+quant_fp8","+rms_norm"],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,0],"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fi_allreduce_fusion":true},"max_capture_size":null,"local_cache_dir":null}}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 10-07 12:03:12 [model.py:547] Resolved architecture: LlamaForCausalLM
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 10-07 12:03:12 [model.py:1510] Using max model len 10240
(APIServer pid=1) INFO 10-07 12:03:13 [cache.py:180] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 10-07 12:03:14 [arg_utils.py:1293] Defaulting to mp-based distributed executor backend for async scheduling.
(APIServer pid=1) INFO 10-07 12:03:14 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 10-07 12:03:14 [modelopt.py:626] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 10-07 12:03:14 [compilation.py:598] Using piecewise compilation with empty splitting_ops and use_inductor_graph_partition=False.
INFO 10-07 12:03:16 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=101) INFO 10-07 12:03:17 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=101) INFO 10-07 12:03:17 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='nvidia/Llama-3.3-70B-Instruct-FP4', speculative_config=None, tokenizer='/models/nvidia/Llama-3.3-70B-Instruct-FP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10240, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=nvidia/Llama-3.3-70B-Instruct-FP4, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["+quant_fp8","+rms_norm"],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,0],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fi_allreduce_fusion":true},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=101) WARNING 10-07 12:03:17 [multiproc_executor.py:720] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=101) INFO 10-07 12:03:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_bd917c15'), local_subscribe_addr='ipc:///tmp/15d6c171-f940-4d86-acef-44e17dc1c950', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-07 12:03:18 [__init__.py:216] Automatically detected platform cuda.
W1007 12:03:20.625000 155 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1007 12:03:20.625000 155 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 10-07 12:03:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_b2aa7488'), local_subscribe_addr='ipc:///tmp/3767a4fe-9e5f-4d93-9e2c-0757b9e78e30', remote_subscribe_addr=None, remote_addr_ipv6=False)
...
There are more warnings in the log. If someone likes the whole log for whatever reasons, I can paste it, but its pretty long...
So I tried all this with vLLM docker version 0.11.0.
If I get the Dockerfile, I will try a full compile with a nightly version and see, if it gets better results in receiving answers from a prompt. First I have to read through all the provided links.