The metrics process_virtual_memory_bytes and process_resident_memory_bytes represent CPU memory usage, not GPU memory. Specifically, process_virtual_memory_bytes is the total virtual address space allocated by the vLLM server process, while process_resident_memory_bytes is the portion of that memory currently held in RAM (resident set size) by the process. These are standard process-level metrics provided by the Prometheus Python client and do not reflect GPU memory usage or allocation by vLLM itself. The memory is consumed by the Python process running vLLM, including code, data, libraries, and any CPU-side data structures or buffers used by vLLM and its dependencies, but not the GPU model weights or KV cache, which reside on the GPU and are not tracked by these metrics.
For GPU memory usage, vLLM manages and reports this separately (e.g., via gpu_memory_utilization and related metrics), but these are not included in the process_virtual_memory_bytes or process_resident_memory_bytes values. For more details, see the explanations in vLLM metrics documentation and Prometheus client metrics.
Would you like more detail on how to monitor GPU memory usage in vLLM?
vLLM does not provide a process-level breakdown of GPU memory usage for model weights, KV cache, and overhead via Prometheus metrics. Instead, vLLM exposes aggregate GPU-related metrics, not per-process details. The main GPU-related metrics available are:
vllm:gpu_cache_usage_perc: Percentage of the pre-allocated GPU KV cache currently in use (1.0 = 100%). This shows how much of the reserved KV cache memory is actively used by running requests, not total GPU memory usage.
vllm:cache_config_info: Reports cache configuration, including gpu_memory_utilization (fraction of total GPU memory reserved for vLLM), number of GPU cache blocks, and related settings.
Other metrics (e.g., vllm:num_requests_running, vllm:num_requests_waiting, etc.) provide context on engine state but not direct GPU memory usage.
vLLM logs at startup provide a one-time breakdown of GPU memory usage for model weights, non-torch memory, PyTorch activation peak, and KV cache, but this is not exposed as a live metric per process. For more, see metrics documentation and example logs.
Would you like a step-by-step guide to interpret these metrics or extract GPU memory usage from logs?
The metric vllm:gpu_cache_usage_perc shows the percentage of the pre-allocated GPU KV cache (as determined by --gpu-memory-utilization) that is currently in use by vLLM requests. The flag --gpu-memory-utilization=0.4 means vLLM will reserve 40% of the available GPU memory for its own use, and the KV cache size is calculated from this reserved memory after accounting for model weights and other overheads. Thus, vllm:gpu_cache_usage_perc reflects usage relative to this reserved portion, not the entire GPU.
When sharing a GPU, --gpu-memory-utilization=0.4 ensures vLLM only uses up to 40% of the GPU’s available memory at startup, allowing other processes to use the remaining memory. This is a per-instance limit and does not coordinate with other vLLM or non-vLLM processes on the same GPU; each instance enforces its own limit independently. For more, see CacheConfig docs and metrics documentation.
Would you like a step-by-step example of how the KV cache size is calculated?
How do I determine / set / calculate what is appropriate value for –gpu-memory-utilization ?
Setting --gpu-memory-utilization=0.4 means, the up to 40% of available memory is inclusive of model weights, kv cache and other overheads? and the vllm:gpu_cache_usage_perc shows the pre-allocated memory from that 40% for KV Cache?
Yes, setting --gpu-memory-utilization=0.4 means vLLM will reserve up to 40% of the total GPU memory for its own use, and this 40% includes model weights, KV cache, and other overheads. After loading model weights and accounting for overheads, the remaining portion of that 40% is used to allocate the KV cache. The metric vllm:gpu_cache_usage_perc then shows the percentage of this pre-allocated KV cache that is currently in use by active requests—not the percentage of total GPU memory.
To determine an appropriate value for --gpu-memory-utilization, consider the size of your model weights, expected KV cache needs (which depend on batch size and sequence length), and leave headroom for activations and overhead. If you set the value too high, you risk OOM errors; too low, and you may limit throughput. The default is 0.9, but for multi-tenant or shared GPU scenarios, a lower value (e.g., 0.4) is safer. For more, see CacheConfig docs and discussion.
Would you like a step-by-step example for a specific model and GPU?