@RunLLM
vLLM allocates a large KV-cache on startup based on gpu_memory_utilization, which makes nvidia-smi report very high GPU memory usage even when the server is idle. I understand that the reserved KV size can be derived from the cache configuration (e.g., num_gpu_blocks × bytes_per_block), but I’m trying to understand how to observe the actively used portion of that memory at runtime.
vLLM exposes the actively used portion of the KV cache via the kv_cache_usage metric, which reports the percentage of allocated KV cache blocks currently in use. This metric is available as vllm:kv_cache_usage_perc in Prometheus and is also logged periodically by the LoggingStatLogger. It reflects the fraction of the preallocated KV cache that is actively occupied by requests at runtime, not just the total reserved memory.
You can monitor this metric by scraping the /metrics endpoint if PrometheusStatLogger is enabled, or by checking the log output if LoggingStatLogger is active. For more details, see Frontend Stats Collection and Metrics Publishing - Logging. Would you like more detail on how to interpret or use this metric?