About monitor the usage of KV cache memory

Hi,
I can set the GPU memory budget for KV cache usage through the initial parameter gpu_memory_utilization.
But how can I observe whether this portion of the GPU memory is actually being used, in other words, how much of it is truly occupied? I want to monitor the usage of this part of GPU memory — how can I do that?

You can monitor actual GPU memory usage for KV cache in vLLM by checking the logs during initialization, which report memory usage by component, including how much is reserved for KV cache and how much is used by model weights and activations. For real-time monitoring, vLLM does not provide a built-in API to directly report live KV cache usage, but you can observe overall GPU memory usage with tools like nvidia-smi and infer KV cache utilization from the difference after model loading and during inference. The logs will show lines like: “model weights take X GiB; … the rest of the memory reserved for KV Cache is Y GiB” and you can compare this to the total GPU memory used as reported by system tools or vLLM logs during runtime.

For more granular monitoring, you may need to instrument the code or use external GPU profiling tools, as vLLM does not expose a direct API for live KV cache usage statistics as of now. See this issue discussion for details and log examples.

Would you like more detail on how to interpret these logs or set up external monitoring?

Sources: