What is Included in -gpu-memory-utilization

Pigcanflysohigh · April 17, 2026, 3:07am

What is Included in gpu-memory-utilization?

I know model weight and KVcache are included.And is there anything others is included?

one more question, if the kv-cache is too large, what will happen? for example, 100G GPU memroy is set for weight and KVcache,but one time the KVcache is too large,then overhead 100G, what will happen?

RunLLM · April 17, 2026, 3:07am

The gpu_memory_utilization parameter in vLLM controls the fraction of total GPU memory reserved for model weights, KV cache, activation memory (intermediate tensors), and system overhead (e.g., CUDA, PyTorch, NCCL, CUDA graphs). It is not just model weights and KV cache; activation buffers and non-torch memory are also included in this allocation. See docs and API reference.

If the KV cache is too large and exceeds the reserved memory (e.g., you set gpu_memory_utilization so high that the sum of weights, activations, and KV cache needs more than available GPU memory), you will get an out-of-memory (OOM) error and the process will fail. vLLM does not automatically spill excess KV cache to CPU unless you explicitly configure offloading; otherwise, OOM occurs. See troubleshooting and issue discussion.

Would you like more detail on memory profiling or how to tune these parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.

Pigcanflysohigh · April 24, 2026, 3:48am

but vllm allow KV cache grown with no limited?

Pigcanflysohigh · April 24, 2026, 8:39am

why don’t you answer my question?

Pigcanflysohigh · April 24, 2026, 8:39am

Does vLLM permit the KV cache to expand without restriction?

Topic		Replies	Views
What does gpu memory utilisation include? General	2	2774	September 22, 2025
About monitor the usage of KV cache memory General	1	800	May 24, 2025
How to understand OOM and foresee memory usage General	5	47	April 24, 2026
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	679	September 8, 2025
Active vs Reserved GPU Memory General	1	55	January 5, 2026

What is Included in -gpu-memory-utilization

Related topics