How log kvcache usage and prefix hit rate in offline infer?

plus · May 12, 2025, 1:48pm

How log kvcache usage and prefix hit rate in offline infer?
i run offline inference with Qwen1.5-14b on 1 GPU,and i want log kvcache usage and prefix hit rate like benchmark, how can do this, please tell me, thanks

RunLLM · May 12, 2025, 1:49pm

To log KV cache usage and prefix cache hit rate during offline inference, you need to use vLLM’s metrics logging system. The LoggingStatLogger outputs these metrics every 5 seconds, including GPU KV cache usage and prefix cache hit rate, but this is enabled by default only in the online serving mode. For offline inference, these metrics are not automatically logged unless you explicitly use the LoggingStatLogger in your code or modify the engine to enable it for offline runs. There is no direct command-line flag or API in the data sources to enable this for offline inference out-of-the-box—this is only described for serving mode and in the codebase for the logger classes (metrics.md, loggers.py).

If you want to log these metrics in offline inference, you would need to modify your inference script to instantiate and call LoggingStatLogger, or adapt the code to periodically print stats.gpu_cache_usage_sys and stats.gpu_prefix_cache_hit_rate from the Stats object used in the engine (metrics_types.py). Would you like a code example for how to do this?

Sources:

plus · May 12, 2025, 1:53pm

appreciate you answer, It would be even better if you could provide an example. :>

RunLLM · May 12, 2025, 1:54pm

It seems likely that you can log KV cache usage and prefix cache hit rate in offline inference by manually accessing the relevant stats from the vLLM engine. There is no official API or command-line flag for this in offline mode, but you can extract these metrics by calling the internal stats methods after running inference. Here’s a minimal example based on the available data sources:

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen1.5-14B", tensor_parallel_size=1)
prompts = ["Your prompt here"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Run inference
outputs = llm.generate(prompts, sampling_params)

# Access internal stats (not a public API, may change)
stats = llm.llm_engine._get_stats()
print(f"GPU KV cache usage: {stats.gpu_cache_usage_sys * 100:.2f}%")
print(f"Prefix cache hit rate: {getattr(stats, 'gpu_prefix_cache_hit_rate', 0) * 100:.2f}%")

This approach is not officially documented for offline inference, but it reflects the internal usage in vLLM’s codebase (metrics_types.py). Would you like a more robust or production-ready example?

Sources:

https://github.com/vllm-project/vllm/pull/2316

plus · May 12, 2025, 1:57pm

i have another question, for a model like Qwen1.5-14b, Would using CUDA_Graph instead of making enforce_eager=True result in better performance? In my test, there was no difference between them,is thar correct?

RunLLM · May 12, 2025, 1:57pm

For most models, using CUDA Graph (i.e., enforce_eager=False, the default) should result in better performance than enforce_eager=True, especially for small to medium batch sizes, because CUDA Graph reduces CPU overhead (engine_args docs, github_issue_4449). However, the actual speedup can be small or negligible for large models or large batch sizes, or if the model is already compute/memory bound (github_pull_request_1926, github_pull_request_1440). If you see no difference in your tests, it may be due to your specific hardware, batch size, or the model’s characteristics.

Would you like a step-by-step explanation of when CUDA Graph provides a noticeable speedup?

Sources:

plus · May 13, 2025, 2:04am

and how log the runinng and waiting list in inference?

RunLLM · May 13, 2025, 2:05am

In offline inference, vLLM does not automatically log the number of running and waiting requests as it does in serving mode. However, you can access these stats by calling the internal _get_stats() method on the engine and printing stats.num_running_sys and stats.num_waiting_sys (metrics_types.py, metrics.md).

Example:

stats = llm.llm_engine._get_stats()
print(f"Running: {stats.num_running_sys} reqs, Waiting: {stats.num_waiting_sys} reqs")

Would you like a full example script for this?

Sources:

metrics.md

plus · May 13, 2025, 2:36am

stats = self.llm.llm_engine._get_stats()
AttributeError: 'LLMEngine' object has no attribute '_get_stats'

i get this error, thanks you answer ：》

RunLLM · May 13, 2025, 2:37am

The method _get_stats() is an internal (private) method and may not be accessible or stable across vLLM versions. In recent versions, this method is not part of the public API and may not be available on the LLMEngine object (metrics_types.py, llm_engine.py). There is currently no officially supported public API for retrieving these stats in offline inference.

If you need these metrics, you would have to modify the vLLM source code to expose them, or use a custom logger as described in the codebase. Would you like guidance on how to patch vLLM to expose these stats?

Sources:

llm_engine.py

plus · May 13, 2025, 2:49am

What is the difference between vllm/engine/llm_engine.py and vllm/v1/engine/llm_engine.py ? Is vllm/engine/llm_engine.py used in offline inference? Thank you for your answer.

RunLLM · May 13, 2025, 2:49am

vllm/engine/llm_engine.py is the legacy (V0) engine, while vllm/v1/engine/llm_engine.py is the new V1 engine. V1 is now the default for all supported use cases, including offline inference, and the V0 engine is kept mainly for backward compatibility (vLLM V1 User Guide, github_issue_17899).

So, for offline inference, vllm/v1/engine/llm_engine.py is used by default. The V0 engine (vllm/engine/llm_engine.py) is only used if you explicitly set VLLM_USE_V1=0 (github_issue_17899). Would you like more detail on the differences between V0 and V1?

Sources:

plus · May 13, 2025, 2:56am

my vllm version is v0.8.1 ,in this version only vllm/engine/llm_engine.py have _get_stats(), but vllm/v1/engine/llm_engine.py do not have, so if i used v1/engine in offline inference,how log kvcache usage, prefix hit rate, running list, waiting list?

RunLLM · May 13, 2025, 2:57am

In vLLM v0.8.1, only the legacy engine (vllm/engine/llm_engine.py) has the _get_stats() method. The V1 engine (vllm/v1/engine/llm_engine.py) does not provide _get_stats() or a public API for logging KV cache usage, prefix hit rate, running, or waiting lists in offline inference (github_issue_15775). There is no supported way to log these metrics in offline inference with the V1 engine in v0.8.1.

If you need these metrics, you must use the V0 engine by setting VLLM_USE_V1=0 (github_issue_15775). Would you like instructions for enabling V0 mode?

Sources:

github_issue_15775

plus · May 13, 2025, 3:01am

Are there any methods available for the v1 engine to log out the kvcache usage, prefix hit rate, running list, and waiting list?

RunLLM · May 13, 2025, 3:01am

There is currently no public API or method in vLLM v1 engine to directly log KV cache usage, prefix hit rate, running list, or waiting list during offline inference. These metrics are available in serving mode via Prometheus or logging, but not exposed for offline inference or through a user-facing API in v1 (github_issue_15775, metrics.md).

If you need these metrics in offline inference, you would have to modify the vLLM source code to expose them, as there is no built-in support for this in v1. Would you like guidance on how to patch the code for this purpose?

Sources:

plus · May 13, 2025, 3:03am

Yes, I need guidance. How should I modify the code to log KV cache usage, prefix hit rate, running list, and waiting list during offline inference

RunLLM · May 13, 2025, 3:04am

To log KV cache usage, prefix hit rate, running list, and waiting list during offline inference in vLLM v1, you need to patch the vLLM source code because these metrics are not exposed by default (metrics.md, github_issue_15775). The general approach is to insert logging statements in the v1 engine code where these stats are available internally.

Step-by-step guidance:

Locate the Stats Object:
In vLLM v1, the Stats dataclass (see metrics_types.py) holds fields like gpu_cache_usage_sys, gpu_prefix_cache_hit_rate, num_running_sys, and num_waiting_sys.
Find Where Stats Are Updated:
Stats are typically updated in the engine or scheduler. For example, in vllm/v1/engine/core.py or the scheduler implementation, look for where a Stats object is created or updated.

Insert Logging in Your Inference Loop:
After each inference step (or at your desired interval), add code to print or log the relevant fields from the Stats object. For example:

# After running inference or at the end of each step
stats = ...  # however you access the Stats object in your context
print(f"GPU KV cache usage: {stats.gpu_cache_usage_sys * 100:.2f}%")
print(f"Prefix cache hit rate: {stats.gpu_prefix_cache_hit_rate * 100:.2f}%")
print(f"Running requests: {stats.num_running_sys}")
print(f"Waiting requests: {stats.num_waiting_sys}")

Expose Stats to Your Script:
If the Stats object is not accessible from your inference script, you may need to add a method to the engine class (e.g., get_stats()) that returns the latest Stats object, similar to how _get_stats() worked in v0.
Rebuild and Use Your Patched vLLM:
After making these changes, rebuild vLLM and run your offline inference script. You should now see the desired metrics printed or logged.

Would you like a concrete code example for a specific vLLM version or more detail on where to insert these changes?

Sources:

plus · May 13, 2025, 5:56am

How to use vllm for online inference, can you give me a code example?

RunLLM · May 13, 2025, 5:56am

To use vLLM for online inference, you should start the OpenAI-compatible API server with your model using the vllm serve command, then send HTTP requests to it. Here’s a minimal example (Quickstart):

Start the server:

vllm serve facebook/opt-125m

Send a request (Python example):

import requests

response = requests.post(
    "http://localhost:8000/v1/completions",
    json={
        "model": "facebook/opt-125m",
        "prompt": "Hello, my name is",
        "max_tokens": 16,
        "temperature": 0.8,
    },
)
print(response.json())

Would you like an example for the chat API or more advanced usage?

Sources:

Quickstart

Topic		Replies	Views
About monitor the usage of KV cache memory General	1	35	May 24, 2025
Why is the prefix cache hit rate constantly increasing KV-Cache	3	132	March 27, 2025
VUA - library code for LLM inference engines for external storage of KV caches KV-Cache	1	17	May 13, 2025
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	75	May 26, 2025
Support for (sparse) key value caching KV-Cache	16	50	May 3, 2025

How log kvcache usage and prefix hit rate in offline infer?

Related topics