Hello!
I am trying to run Gemma 4 31B NVFP4 on 2x 5070 Ti (2x 16GB VRAM). With sharded model this should easily fit (and in fact it does run in llama.cpp with a 4-bit quant with plenty of VRAM left). However, I am getting OOM even with max_num_seqs=1 and max-model-len 1024. What am I doing wrong?
vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --max_num_batched_tokens 1024 --language-model-only --max_num_seqs 1 --max-model-len 1024 --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --tensor_parallel_size 2
WARNING 05-30 01:45:33 [argparse_utils.py:257] With
vllm serve, you should provide the model as a positional argument or in a config file instead of via the--modeloption. The--modeloption will be removed in a future version.
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306]
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] █ █ █▄ ▄█
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.21.0
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] █▄█▀ █ █ █ █ model nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:306]
(APIServer pid=17897) INFO 05-30 01:45:33 [utils.py:240] non-default args: {‘model_tag’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘model’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘max_model_len’: 1024, ‘tensor_parallel_size’: 2, ‘gpu_memory_utilization’: 0.9, ‘kv_cache_dtype’: ‘fp8’, ‘language_model_only’: True, ‘max_num_batched_tokens’: 1024, ‘max_num_seqs’: 1}
(APIServer pid=17897) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=17897) INFO 05-30 01:45:33 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=17897) INFO 05-30 01:45:33 [model.py:1697] Using max model len 1024
(APIServer pid=17897) INFO 05-30 01:45:34 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=17897) INFO 05-30 01:45:34 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024.
(APIServer pid=17897) INFO 05-30 01:45:34 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=17897) WARNING 05-30 01:45:34 [modelopt.py:1024] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=17897) INFO 05-30 01:45:34 [vllm.py:886] Asynchronous scheduling is enabled.
(APIServer pid=17897) INFO 05-30 01:45:34 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=17897) WARNING 05-30 01:45:34 [cuda.py:237] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=17897) INFO 05-30 01:45:34 [compilation.py:303] Enabled custom fusions: act_quant
(APIServer pid=17897) INFO 05-30 01:45:36 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=17968) INFO 05-30 01:45:41 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model=‘nvidia/Gemma-4-31B-IT-NVFP4’, speculative_config=None, tokenizer=‘nvidia/Gemma-4-31B-IT-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::gdn_attention_core_xpu’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::deepseek_v4_attention’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_vision_items_per_batch’: 0, ‘encoder_cudagraph_max_frames_per_batch’: None, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [1024], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False, ‘fuse_act_padding’: False}, ‘max_cudagraph_capture_size’: 2, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: }, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’]), enable_flashinfer_autotune=False, moe_backend=‘auto’)
(EngineCore pid=17968) WARNING 05-30 01:45:41 [multiproc_executor.py:1029] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=17968) INFO 05-30 01:45:41 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.7.140 (local), world_size=2, local_world_size=2
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 05-30 01:45:46 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=18032) INFO 05-30 01:45:46 [parallel_state.py:1410] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:45699 backend=nccl
INFO 05-30 01:45:46 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=18031) INFO 05-30 01:45:46 [parallel_state.py:1410] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:45699 backend=nccl
(Worker pid=18031) INFO 05-30 01:45:47 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=18031) WARNING 05-30 01:45:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=18032) WARNING 05-30 01:45:47 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=18032) WARNING 05-30 01:45:47 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=18031) WARNING 05-30 01:45:47 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=18031) INFO 05-30 01:45:47 [parallel_state.py:1723] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=18031) INFO 05-30 01:45:47 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [gpu_model_runner.py:4857] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4…
(Worker_TP1 pid=18032) INFO 05-30 01:45:47 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [vllm.py:886] Asynchronous scheduling is enabled.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [compilation.py:303] Enabled custom fusions: act_quant
(Worker_TP1 pid=18032) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [init.py:687] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP1 pid=18032) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) INFO 05-30 01:45:47 [cuda.py:312] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=18031) ERROR 05-30 01:45:48 [gpu_model_runner.py:4957] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See Conserving Memory - vLLM for more tips. (original error: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 15.47 GiB of which 62.25 MiB is free. Process 5949 has 56.53 MiB memory in use. Process 6282 has 216.98 MiB memory in use. Including non-PyTorch memory, this process has 14.55 GiB memory in use. Of the allocated memory 14.01 GiB is allocated by PyTorch, and 104.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (Redirecting…))
I am running 0.21.0.
Thanks