Active vs Reserved GPU Memory

ydem · January 5, 2026, 1:55pm

@RunLLM
vLLM allocates a large KV-cache on startup based on gpu_memory_utilization, which makes nvidia-smi report very high GPU memory usage even when the server is idle. I understand that the reserved KV size can be derived from the cache configuration (e.g., num_gpu_blocks × bytes_per_block), but I’m trying to understand how to observe the actively used portion of that memory at runtime.

I looked through the general metrics at
https://docs.vllm.ai/en/latest/usage/metrics/#general-metrics
and I can scrape those successfully, but I don’t see any metrics related to actively used percentage.

Thanks!

RunLLM · January 5, 2026, 1:55pm

vLLM exposes the actively used portion of the KV cache via the kv_cache_usage metric, which reports the percentage of allocated KV cache blocks currently in use. This metric is available as vllm:kv_cache_usage_perc in Prometheus and is also logged periodically by the LoggingStatLogger. It reflects the fraction of the preallocated KV cache that is actively occupied by requests at runtime, not just the total reserved memory.

You can monitor this metric by scraping the /metrics endpoint if PrometheusStatLogger is enabled, or by checking the log output if LoggingStatLogger is active. For more details, see Frontend Stats Collection and Metrics Publishing - Logging. Would you like more detail on how to interpret or use this metric?

Sources:

Frontend Stats Collection

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
About monitor the usage of KV cache memory General	1	606	May 24, 2025
Interpret process_memory metrics available from /metrics endpoint General	7	240	July 31, 2025
What does gpu memory utilisation include? General	2	1762	September 22, 2025
Don't get more KV Cache when running with --enforce-eager General	1	229	September 3, 2025
V1 GPU not free gpu memory General	2	95	November 5, 2025

Active vs Reserved GPU Memory

Related topics