How log kvcache usage and prefix hit rate in offline infer?
i run offline inference with Qwen1.5-14b on 1 GPU,and i want log kvcache usage and prefix hit rate like benchmark, how can do this, please tell me, thanks
To log KV cache usage and prefix cache hit rate during offline inference, you need to use vLLM’s metrics logging system. The LoggingStatLogger outputs these metrics every 5 seconds, including GPU KV cache usage and prefix cache hit rate, but this is enabled by default only in the online serving mode. For offline inference, these metrics are not automatically logged unless you explicitly use the LoggingStatLogger in your code or modify the engine to enable it for offline runs. There is no direct command-line flag or API in the data sources to enable this for offline inference out-of-the-box—this is only described for serving mode and in the codebase for the logger classes (metrics.md, loggers.py).
If you want to log these metrics in offline inference, you would need to modify your inference script to instantiate and call LoggingStatLogger, or adapt the code to periodically print stats.gpu_cache_usage_sys
and stats.gpu_prefix_cache_hit_rate
from the Stats
object used in the engine (metrics_types.py). Would you like a code example for how to do this?
Sources:
appreciate you answer, It would be even better if you could provide an example. :>
It seems likely that you can log KV cache usage and prefix cache hit rate in offline inference by manually accessing the relevant stats from the vLLM engine. There is no official API or command-line flag for this in offline mode, but you can extract these metrics by calling the internal stats methods after running inference. Here’s a minimal example based on the available data sources:
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen1.5-14B", tensor_parallel_size=1)
prompts = ["Your prompt here"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Run inference
outputs = llm.generate(prompts, sampling_params)
# Access internal stats (not a public API, may change)
stats = llm.llm_engine._get_stats()
print(f"GPU KV cache usage: {stats.gpu_cache_usage_sys * 100:.2f}%")
print(f"Prefix cache hit rate: {getattr(stats, 'gpu_prefix_cache_hit_rate', 0) * 100:.2f}%")
This approach is not officially documented for offline inference, but it reflects the internal usage in vLLM’s codebase (metrics_types.py). Would you like a more robust or production-ready example?
Sources:
i have another question, for a model like Qwen1.5-14b, Would using CUDA_Graph instead of making enforce_eager=True result in better performance? In my test, there was no difference between them,is thar correct?
For most models, using CUDA Graph (i.e., enforce_eager=False, the default) should result in better performance than enforce_eager=True, especially for small to medium batch sizes, because CUDA Graph reduces CPU overhead (engine_args docs, github_issue_4449). However, the actual speedup can be small or negligible for large models or large batch sizes, or if the model is already compute/memory bound (github_pull_request_1926, github_pull_request_1440). If you see no difference in your tests, it may be due to your specific hardware, batch size, or the model’s characteristics.
Would you like a step-by-step explanation of when CUDA Graph provides a noticeable speedup?
Sources:
and how log the runinng and waiting list in inference?
In offline inference, vLLM does not automatically log the number of running and waiting requests as it does in serving mode. However, you can access these stats by calling the internal _get_stats()
method on the engine and printing stats.num_running_sys
and stats.num_waiting_sys
(metrics_types.py, metrics.md).
Example:
stats = llm.llm_engine._get_stats()
print(f"Running: {stats.num_running_sys} reqs, Waiting: {stats.num_waiting_sys} reqs")
Would you like a full example script for this?
Sources:
stats = self.llm.llm_engine._get_stats()
AttributeError: 'LLMEngine' object has no attribute '_get_stats'
i get this error, thanks you answer :》
The method _get_stats()
is an internal (private) method and may not be accessible or stable across vLLM versions. In recent versions, this method is not part of the public API and may not be available on the LLMEngine
object (metrics_types.py, llm_engine.py). There is currently no officially supported public API for retrieving these stats in offline inference.
If you need these metrics, you would have to modify the vLLM source code to expose them, or use a custom logger as described in the codebase. Would you like guidance on how to patch vLLM to expose these stats?
Sources:
What is the difference between vllm/engine/llm_engine.py
and vllm/v1/engine/llm_engine.py
? Is vllm/engine/llm_engine.py
used in offline inference? Thank you for your answer.
vllm/engine/llm_engine.py
is the legacy (V0) engine, while vllm/v1/engine/llm_engine.py
is the new V1 engine. V1 is now the default for all supported use cases, including offline inference, and the V0 engine is kept mainly for backward compatibility (vLLM V1 User Guide, github_issue_17899).
So, for offline inference, vllm/v1/engine/llm_engine.py
is used by default. The V0 engine (vllm/engine/llm_engine.py
) is only used if you explicitly set VLLM_USE_V1=0
(github_issue_17899). Would you like more detail on the differences between V0 and V1?
Sources:
my vllm version is v0.8.1 ,in this version only vllm/engine/llm_engine.py have _get_stats(), but vllm/v1/engine/llm_engine.py do not have, so if i used v1/engine in offline inference,how log kvcache usage, prefix hit rate, running list, waiting list?
In vLLM v0.8.1, only the legacy engine (vllm/engine/llm_engine.py
) has the _get_stats()
method. The V1 engine (vllm/v1/engine/llm_engine.py
) does not provide _get_stats()
or a public API for logging KV cache usage, prefix hit rate, running, or waiting lists in offline inference (github_issue_15775). There is no supported way to log these metrics in offline inference with the V1 engine in v0.8.1.
If you need these metrics, you must use the V0 engine by setting VLLM_USE_V1=0
(github_issue_15775). Would you like instructions for enabling V0 mode?
Sources:
Are there any methods available for the v1 engine to log out the kvcache usage, prefix hit rate, running list, and waiting list?
There is currently no public API or method in vLLM v1 engine to directly log KV cache usage, prefix hit rate, running list, or waiting list during offline inference. These metrics are available in serving mode via Prometheus or logging, but not exposed for offline inference or through a user-facing API in v1 (github_issue_15775, metrics.md).
If you need these metrics in offline inference, you would have to modify the vLLM source code to expose them, as there is no built-in support for this in v1. Would you like guidance on how to patch the code for this purpose?
Sources:
Yes, I need guidance. How should I modify the code to log KV cache usage, prefix hit rate, running list, and waiting list during offline inference
To log KV cache usage, prefix hit rate, running list, and waiting list during offline inference in vLLM v1, you need to patch the vLLM source code because these metrics are not exposed by default (metrics.md, github_issue_15775). The general approach is to insert logging statements in the v1 engine code where these stats are available internally.
Step-by-step guidance:
-
Locate the Stats Object:
In vLLM v1, theStats
dataclass (see metrics_types.py) holds fields likegpu_cache_usage_sys
,gpu_prefix_cache_hit_rate
,num_running_sys
, andnum_waiting_sys
. -
Find Where Stats Are Updated:
Stats are typically updated in the engine or scheduler. For example, invllm/v1/engine/core.py
or the scheduler implementation, look for where aStats
object is created or updated. -
Insert Logging in Your Inference Loop:
After each inference step (or at your desired interval), add code to print or log the relevant fields from theStats
object. For example:# After running inference or at the end of each step stats = ... # however you access the Stats object in your context print(f"GPU KV cache usage: {stats.gpu_cache_usage_sys * 100:.2f}%") print(f"Prefix cache hit rate: {stats.gpu_prefix_cache_hit_rate * 100:.2f}%") print(f"Running requests: {stats.num_running_sys}") print(f"Waiting requests: {stats.num_waiting_sys}")
-
Expose Stats to Your Script:
If theStats
object is not accessible from your inference script, you may need to add a method to the engine class (e.g.,get_stats()
) that returns the latestStats
object, similar to how_get_stats()
worked in v0. -
Rebuild and Use Your Patched vLLM:
After making these changes, rebuild vLLM and run your offline inference script. You should now see the desired metrics printed or logged.
Would you like a concrete code example for a specific vLLM version or more detail on where to insert these changes?
Sources:
How to use vllm for online inference, can you give me a code example?
To use vLLM for online inference, you should start the OpenAI-compatible API server with your model using the vllm serve
command, then send HTTP requests to it. Here’s a minimal example (Quickstart):
- Start the server:
vllm serve facebook/opt-125m
- Send a request (Python example):
import requests
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "facebook/opt-125m",
"prompt": "Hello, my name is",
"max_tokens": 16,
"temperature": 0.8,
},
)
print(response.json())
Would you like an example for the chat API or more advanced usage?
Sources: