OOM Trying to run Gemma 4 31B NVFP4 on 2x16GB

Hello!

I am trying to run Gemma 4 31B NVFP4 on 2x 5070 Ti (2x 16GB VRAM). With sharded model this should easily fit (and in fact it does run in llama.cpp with a 4-bit quant with plenty of VRAM left). However, I am getting OOM even with max_num_seqs=1 and max-model-len 1024. What am I doing wrong?

vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --max_num_batched_tokens 1024 --language-model-only --max_num_seqs 1 --max-model-len 1024 --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --tensor_parallel_size 2

WARNING 05-30 01:45:33 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version.
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306]
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] █ █ █▄ ▄█
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.21.0
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] █▄█▀ █ █ █ █ model nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306]
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:240] non-default args: {‘model_tag’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘model’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘max_model_len’: 1024, ‘tensor_parallel_size’: 2, ‘gpu_memory_utilization’: 0.9, ‘kv_cache_dtype’: ‘fp8’, ‘language_model_only’: True, ‘max_num_batched_tokens’: 1024, ‘max_num_seqs’: 1}
(APIServer pid=17897) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=17897) INFO 05-30 01:45:33 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=17897) INFO 05-30 01:45:33 [model.py:1697] Using max model len 1024
(APIServer pid=17897) INFO 05-30 01:45:34 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=17897) INFO 05-30 01:45:34 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024.
(APIServer pid=17897) INFO 05-30 01:45:34 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=17897) WARNING 05-30 01:45:34 [modelopt.py:1024] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=17897) INFO 05-30 01:45:34 [vllm.py:886] Asynchronous scheduling is enabled.
(APIServer pid=17897) INFO 05-30 01:45:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=17897) WARNING 05-30 01:45:34 [cuda.py:237] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=17897) INFO 05-30 01:45:34 [compilation.py:303] Enabled custom fusions: act_quant
(APIServer pid=17897) INFO 05-30 01:45:36 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=17968) INFO 05-30 01:45:41 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model=‘nvidia/Gemma-4-31B-IT-NVFP4’, speculative_config=None, tokenizer=‘nvidia/Gemma-4-31B-IT-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::gdn_attention_core_xpu’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::deepseek_v4_attention’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_vision_items_per_batch’: 0, ‘encoder_cudagraph_max_frames_per_batch’: None, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [1024], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False, ‘fuse_act_padding’: False}, ‘max_cudagraph_capture_size’: 2, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: }, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’]), enable_flashinfer_autotune=False, moe_backend=‘auto’)
(EngineCore pid=17968) WARNING 05-30 01:45:41 [multiproc_executor.py:1029] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=17968) INFO 05-30 01:45:41 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.7.140 (local), world_size=2, local_world_size=2
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 05-30 01:45:46 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=18032) INFO 05-30 01:45:46 [parallel_state.py:1410] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:45699 backend=nccl
INFO 05-30 01:45:46 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=18031) INFO 05-30 01:45:46 [parallel_state.py:1410] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:45699 backend=nccl
(Worker pid=18031) INFO 05-30 01:45:47 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=18031) WARNING 05-30 01:45:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=18032) WARNING 05-30 01:45:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=18032) WARNING 05-30 01:45:47 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=18031) WARNING 05-30 01:45:47 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=18031) INFO 05-30 01:45:47 [parallel_state.py:1723] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=18031) INFO 05-30 01:45:47 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [gpu_model_runner.py:4857] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4…
(Worker_TP1 pid=18032) INFO 05-30 01:45:47 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [vllm.py:886] Asynchronous scheduling is enabled.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP1 pid=18032) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [init.py:687] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP1 pid=18032) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) ERROR 05-30 01:45:48 [gpu_model_runner.py:4957] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See Conserving Memory - vLLM for more tips. (original error: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 15.47 GiB of which 62.25 MiB is free. Process 5949 has 56.53 MiB memory in use. Process 6282 has 216.98 MiB memory in use. Including non-PyTorch memory, this process has 14.55 GiB memory in use. Of the allocated memory 14.01 GiB is allocated by PyTorch, and 104.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (Redirecting…))

I am running 0.21.0.

Thanks

Your issue is that vLLM’s memory usage for Gemma 4 31B NVFP4 is much higher than expected, even with aggressive settings. This is a known limitation: vLLM currently loads the full-precision weights into memory before quantization, so the initial allocation can exceed your 16GB per GPU, causing OOM—even if the quantized weights would fit. This is not the case in llama.cpp, which streams quantized weights directly and is thus more memory efficient for low-VRAM setups. There is no official vLLM option to avoid this initial allocation for ModelOpt/NVFP4 models as of now (source, source).

You can try:

  • Lowering --gpu-memory-utilization further (e.g., 0.7 or 0.6), but this may not help if the initial allocation is the problem.
  • Increasing tensor_parallel_size (requires more GPUs).
  • Waiting for upstream changes to support true low-memory quantized loading for ModelOpt/NVFP4 in vLLM.

Would you like a step-by-step breakdown of the memory allocation process or more detail on possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM But I’m loading a pre-quantized checkpoint?

Even when loading a pre-quantized NVFP4 checkpoint, vLLM still loads the full-precision weights into GPU memory before quantization, which causes a large memory spike and can lead to OOM on 16GB GPUs (source, source). This is a documented limitation—vLLM does not yet support true streaming or incremental loading of quantized weights for ModelOpt/NVFP4; the entire model is loaded at high precision first, then quantized in memory.

llama.cpp and some other frameworks avoid this by loading only the quantized weights, but vLLM’s current implementation does not. There is no config flag to change this behavior as of now.

Would you like more detail on the technical reasons or possible future improvements?

Sources:

Hint: Mention @RunLLM in the post for followups.

Try CPU-offloading, offload to RAM, wait warming-up and good to go :hugs: