What does gpu memory utilisation include?

anticool · September 22, 2025, 10:37am

How should we set the gpu-memory-utilization? It seems the default value is 0.9. If we have 2 H100 GPUs, do we consider the 90% of 160GB VRAM for KV Cache, Weights and Activations VRAM? The remaining 10% is for overhead?

RunLLM · September 22, 2025, 10:37am

The gpu_memory_utilization parameter sets the fraction of each GPU’s memory that vLLM will use for model weights, activations, and KV cache. The default is 0.9, meaning 90% of each GPU’s VRAM is available for these purposes, and the remaining 10% is reserved for overhead (e.g., CUDA graphs, kernels, fragmentation, and temporary allocations). For 2 H100 GPUs (80GB each, 160GB total), vLLM will use up to 72GB per GPU (0.9 × 80GB) for its allocations; this is per-GPU, not pooled across both GPUs. The 10% buffer helps prevent out-of-memory errors due to unpredictable runtime allocations and fragmentation.

You should not set gpu_memory_utilization to 1.0, as this can cause OOM errors; values between 0.9 and 0.95 are generally safe and recommended. The parameter applies per vLLM instance, so if you run multiple vLLM instances on the same GPU, each instance uses its own fraction of the available memory at the time of allocation. For more details, see the official documentation and discussion: docs.vllm.ai - engine_args, docs.vllm.ai - optimization, and vllm-project/vllm#19805.

Would you like a step-by-step breakdown of how vLLM allocates memory with this parameter?

Sources:

anticool · September 22, 2025, 10:39am

What is a recommended allocation for activations as % for GPU VRAM?

Topic		Replies	Views
Run vLLM on two diffrent GPU General	1	363	May 21, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	422	May 21, 2025
About monitor the usage of KV cache memory General	1	252	May 24, 2025
Interpret process_memory metrics available from /metrics endpoint General	7	81	July 31, 2025
Why vLLM uses alot of CPU memory General	1	840	April 21, 2025

What does gpu memory utilisation include?

Related topics