What does gpu memory utilisation include?

How should we set the gpu-memory-utilization? It seems the default value is 0.9. If we have 2 H100 GPUs, do we consider the 90% of 160GB VRAM for KV Cache, Weights and Activations VRAM? The remaining 10% is for overhead?

The gpu_memory_utilization parameter sets the fraction of each GPU’s memory that vLLM will use for model weights, activations, and KV cache. The default is 0.9, meaning 90% of each GPU’s VRAM is available for these purposes, and the remaining 10% is reserved for overhead (e.g., CUDA graphs, kernels, fragmentation, and temporary allocations). For 2 H100 GPUs (80GB each, 160GB total), vLLM will use up to 72GB per GPU (0.9 × 80GB) for its allocations; this is per-GPU, not pooled across both GPUs. The 10% buffer helps prevent out-of-memory errors due to unpredictable runtime allocations and fragmentation.

You should not set gpu_memory_utilization to 1.0, as this can cause OOM errors; values between 0.9 and 0.95 are generally safe and recommended. The parameter applies per vLLM instance, so if you run multiple vLLM instances on the same GPU, each instance uses its own fraction of the available memory at the time of allocation. For more details, see the official documentation and discussion: docs.vllm.ai - engine_args, docs.vllm.ai - optimization, and vllm-project/vllm#19805.

Would you like a step-by-step breakdown of how vLLM allocates memory with this parameter?

Sources:

What is a recommended allocation for activations as % for GPU VRAM?

1 Like