Interpret process_memory metrics available from /metrics endpoint

kev-iotairx · July 31, 2025, 6:13pm

What memory these process-level metrics represent? Is it GPU or CPU memory? If it is CPU, what exactly is consuming the memory?

From the vllm server logs:

HELP process_virtual_memory_bytes Virtual memory size in bytes.

TYPE process_virtual_memory_bytes gauge

process_virtual_memory_bytes 5.452546048e+09

HELP process_resident_memory_bytes Resident memory size in bytes.

TYPE process_resident_memory_bytes gauge

process_resident_memory_bytes 1.085632512e+09

RunLLM · July 31, 2025, 6:14pm

The metrics process_virtual_memory_bytes and process_resident_memory_bytes represent CPU memory usage, not GPU memory. Specifically, process_virtual_memory_bytes is the total virtual address space allocated by the vLLM server process, while process_resident_memory_bytes is the portion of that memory currently held in RAM (resident set size) by the process. These are standard process-level metrics provided by the Prometheus Python client and do not reflect GPU memory usage or allocation by vLLM itself. The memory is consumed by the Python process running vLLM, including code, data, libraries, and any CPU-side data structures or buffers used by vLLM and its dependencies, but not the GPU model weights or KV cache, which reside on the GPU and are not tracked by these metrics.

For GPU memory usage, vLLM manages and reports this separately (e.g., via gpu_memory_utilization and related metrics), but these are not included in the process_virtual_memory_bytes or process_resident_memory_bytes values. For more details, see the explanations in vLLM metrics documentation and Prometheus client metrics.

Would you like more detail on how to monitor GPU memory usage in vLLM?

Sources:

kev-iotairx · July 31, 2025, 9:14pm

I’d like more details on how to monitor GPU memory usage for each process.

Can I get a process-level breakdown of gpu memory usage for model weights, kv cache and any overhead?

List all the available GPU related metrics and describe what vllm:gpu_cache_usage_perc represents.

RunLLM · July 31, 2025, 9:14pm

vLLM does not provide a process-level breakdown of GPU memory usage for model weights, KV cache, and overhead via Prometheus metrics. Instead, vLLM exposes aggregate GPU-related metrics, not per-process details. The main GPU-related metrics available are:

vllm:gpu_cache_usage_perc: Percentage of the pre-allocated GPU KV cache currently in use (1.0 = 100%). This shows how much of the reserved KV cache memory is actively used by running requests, not total GPU memory usage.
vllm:cache_config_info: Reports cache configuration, including gpu_memory_utilization (fraction of total GPU memory reserved for vLLM), number of GPU cache blocks, and related settings.
Other metrics (e.g., vllm:num_requests_running, vllm:num_requests_waiting, etc.) provide context on engine state but not direct GPU memory usage.

vLLM logs at startup provide a one-time breakdown of GPU memory usage for model weights, non-torch memory, PyTorch activation peak, and KV cache, but this is not exposed as a live metric per process. For more, see metrics documentation and example logs.

Would you like a step-by-step guide to interpret these metrics or extract GPU memory usage from logs?

Sources:

kev-iotairx · July 31, 2025, 9:20pm

How do I interpret each of these gpu-related metrics?

Does the flag --gpu-memory-utilization=0.4 in vllm serve determine the GPU kv cache size vllm:gpu_cache_usage_perc ?

When sharing a single GPU, what does the —gpu-memory-utilization=0.4 represent?

RunLLM · July 31, 2025, 9:21pm

The metric vllm:gpu_cache_usage_perc shows the percentage of the pre-allocated GPU KV cache (as determined by --gpu-memory-utilization) that is currently in use by vLLM requests. The flag --gpu-memory-utilization=0.4 means vLLM will reserve 40% of the available GPU memory for its own use, and the KV cache size is calculated from this reserved memory after accounting for model weights and other overheads. Thus, vllm:gpu_cache_usage_perc reflects usage relative to this reserved portion, not the entire GPU.

When sharing a GPU, --gpu-memory-utilization=0.4 ensures vLLM only uses up to 40% of the GPU’s available memory at startup, allowing other processes to use the remaining memory. This is a per-instance limit and does not coordinate with other vLLM or non-vLLM processes on the same GPU; each instance enforces its own limit independently. For more, see CacheConfig docs and metrics documentation.

Would you like a step-by-step example of how the KV cache size is calculated?

Sources:

CacheConfig docs

kev-iotairx · July 31, 2025, 9:31pm

How do I determine / set / calculate what is appropriate value for –gpu-memory-utilization ?

Setting --gpu-memory-utilization=0.4 means, the up to 40% of available memory is inclusive of model weights, kv cache and other overheads? and the vllm:gpu_cache_usage_perc shows the pre-allocated memory from that 40% for KV Cache?

RunLLM · July 31, 2025, 9:32pm

Yes, setting --gpu-memory-utilization=0.4 means vLLM will reserve up to 40% of the total GPU memory for its own use, and this 40% includes model weights, KV cache, and other overheads. After loading model weights and accounting for overheads, the remaining portion of that 40% is used to allocate the KV cache. The metric vllm:gpu_cache_usage_perc then shows the percentage of this pre-allocated KV cache that is currently in use by active requests—not the percentage of total GPU memory.

To determine an appropriate value for --gpu-memory-utilization, consider the size of your model weights, expected KV cache needs (which depend on batch size and sequence length), and leave headroom for activations and overhead. If you set the value too high, you risk OOM errors; too low, and you may limit throughput. The default is 0.9, but for multi-tenant or shared GPU scenarios, a lower value (e.g., 0.4) is safer. For more, see CacheConfig docs and discussion.

Would you like a step-by-step example for a specific model and GPU?

Sources:

Topic		Replies	Views
About monitor the usage of KV cache memory General	1	264	May 24, 2025
Run vLLM on two diffrent GPU General	1	375	May 21, 2025
Why vLLM uses alot of CPU memory General	1	864	April 21, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	440	May 21, 2025
VLLM Engine Metrics Benchmarking	20	142	June 11, 2025

Interpret process_memory metrics available from /metrics endpoint

Related topics