How to get the log for benchmarking

When I used the official vllm/benchmarks/throughput.py , the system will provided an auto log from loggers.py to show some detailed data such as throughput , gpu kv cache usage and prefix cache hit rate. I have also seen threads sharing log with ttft, tpot, etc…

However, when I customized on the benchmarks file to get more data, the log just suddenly disappeared. I’m wondering how can I get the log back and get even more data like ttft. @RunLLM

To restore the detailed logs (throughput, GPU KV cache usage, prefix cache hit rate, etc.) from loggers.py when running or customizing benchmarks/throughput.py, ensure that logging statistics are enabled by not setting --disable-log-stats or by explicitly setting it to False. This flag controls whether vLLM prints periodic performance logs to the console. For even more frequent or verbose logs, set the environment variable VLLM_LOGGING_LEVEL=DEBUG and optionally adjust VLLM_LOG_STATS_INTERVAL for log frequency, as described in the troubleshooting guide.

To get additional metrics like TTFT (Time To First Token) and TPOT (Time Per Output Token), use the --save-result flag with the benchmark scripts. This will save a JSON file containing per-request statistics, including TTFT and ITL (Inter-Token Latency). If you want these metrics in your custom script, ensure you are collecting and printing them from the benchmark results, as shown in the official scripts. For reference, see the discussion on latency metrics and how to save results.

Would you like a code example or more detail on customizing the logging or metrics output?

Sources:

Hint: Mention @RunLLM in the post for followups.

i think i didn’t use the -–disable-log-stats flag and there isn’t a --save-result flag as it reported unrecognized arguments.

Maybe you can just provide a detailed code example on how to get and print these results (ttft, gpu kv cache usage, prefix cache hit rate) in custom script, as shown in the official scripts @RunLLM

Certainly! To print TTFT, GPU KV cache usage, and prefix cache hit rate in a custom script, you need to (1) collect per-request metrics from the outputs, and (2) access engine-level metrics via the LLM object. The official offline_inference/metrics.py example demonstrates how to print engine metrics, while TTFT is available in the outputs returned by llm.generate().

Here’s a concise example based on the official scripts:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m", disable_log_stats=False)

outputs = llm.generate(prompts, sampling_params)

# Print TTFT for each output (if available)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    ttft = getattr(output, "ttft", None)
    print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}\nTTFT: {ttft}")
    print("-" * 50)

# Print engine-level metrics (GPU KV cache usage, prefix cache hit rate, etc.)
for metric in llm.get_metrics():
    if metric.name == "gpu_kv_cache_usage":
        print(f"GPU KV cache usage: {metric.value}")
    elif metric.name == "prefix_cache_hit_rate":
        print(f"Prefix cache hit rate: {metric.value}")
    else:
        print(f"{metric.name}: {getattr(metric, 'value', getattr(metric, 'values', None))}")

This approach is directly shown in offline_inference/metrics.py. TTFT and other per-request metrics are available in the output objects, while engine metrics are accessed via llm.get_metrics().

Would you like a breakdown of how to extract more custom metrics or further explanation?

Sources:

Hint: Mention @RunLLM in the post for followups.

however, your code got the result of TTFT: None and GPU KV Cache Usage: 0 which makes no sense @RunLLM

You’re correct: in recent vLLM versions, the per-request .ttft and .metrics fields are often None, and llm.get_metrics() may show gpu_kv_cache_usage as 0 if no requests are running or if logging is misconfigured. This is a known issue in vLLM V1, as discussed in issue #25336 and issue #16348: per-request metrics like TTFT are not always populated, and engine metrics may not update unless requests are actively processed and logging is enabled.

To reliably get these metrics:

  • Ensure you are using the latest vLLM main branch, as some fixes are not in released versions.
  • For TTFT and per-request metrics, use the benchmark scripts (benchmarks/serve.py or benchmark_serving.py) which aggregate and print these values after the run, rather than relying on .metrics in the output.
  • For engine metrics, make sure disable_log_stats=False and that you are running active requests; otherwise, metrics like GPU KV cache usage will remain at 0.

Would you like a step-by-step guide to extracting TTFT and cache metrics using the benchmark scripts?

Sources:

Hint: Mention @RunLLM in the post for followups.

yes, please give me as step-by-step guide to extracting ttft and gpu kv cache usage
maybe also avg scheduling delay and batch utilization please @RunLLM

It seems likely that the most reliable way to extract TTFT (Time To First Token), GPU KV cache usage, and related metrics is to use vLLM’s official benchmark scripts, which are designed to collect and print these statistics. Here’s a step-by-step guide:

1. Use the Official Benchmark Script

Run the benchmark script (e.g., benchmarks/benchmark_serving.py or vllm bench serve) with your desired model and parameters. For example:

vllm bench serve \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 128 \
  --max-concurrency 8 \
  --num-prompts 100

This will print a summary at the end, including TTFT, TPOT, ITL, and throughput. See benchmark example and Llama4-Scout recipe.

2. Enable and View Engine Metrics

While the benchmark runs, vLLM logs engine metrics (GPU KV cache usage, prefix cache hit rate, etc.) every few seconds if disable_log_stats is not set. These appear in the console as:

INFO [metrics.py] Avg prompt throughput: ... tokens/s, Avg generation throughput: ... tokens/s, Running: ... reqs, GPU KV cache usage: ...%, Prefix cache hit rate: ...%

If you want to collect these programmatically, you can scrape the /metrics Prometheus endpoint (see metrics design) or parse the log output.

3. Extract Batch Utilization and Scheduling Delay

  • Batch Utilization: This is not directly printed, but you can infer it from the number of running requests and the max_num_seqs parameter in the log lines.
  • Avg Scheduling Delay: Not directly exposed, but queue times per request are available in the Prometheus metrics as vllm:request_queue_time_seconds (see metrics loggers.py).

4. Save Detailed Results

For per-request TTFT and latency, use the --save-result flag (if available) with the benchmark script to save a JSON file containing detailed metrics for each request. You can then parse this file for TTFT and queue times.

Summary Table Example (from script output):

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  19.37
Request throughput (req/s):              51.63
Output token throughput (tok/s):         6628.49
---------------Time to First Token----------------
Mean TTFT (ms):                          3819.32
Median TTFT (ms):                        3076.68
P99 TTFT (ms):                           12275.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           182.94

Would you like a code snippet for parsing the saved JSON results or for scraping the Prometheus metrics endpoint?

Sources:

Hint: Mention @RunLLM in the post for followups.

so can i get ttft without using the official benchmark script @RunLLM

You cannot reliably get TTFT (Time To First Token) for each request using the standard llm.generate() API in vLLM V1, because the per-request .ttft or .metrics fields are often None or not populated, as confirmed in issue #25336 and issue #16348. The official benchmark scripts implement custom timing logic to measure TTFT externally.

If you want TTFT outside the benchmark script, you must manually record the time just before sending the request and the time when the first token is received (for streaming APIs), or measure the elapsed time for the first output in a batch. This is not as accurate as the internal TTFT metric, and is not natively supported in the vLLM Python API as of now.

Would you like a code example for manual TTFT measurement using Python timing?

Sources:

Hint: Mention @RunLLM in the post for followups.

then can i got the avg prompt/generation throughput as the one in log

please give me a step-to-step code @RunLLM

Yes, you can get average prompt/generation throughput similar to the log output by timing your batch and counting tokens. Here’s a step-by-step code example using the vLLM Python API:

import time
from vllm import LLM, SamplingParams

# 1. Prepare prompts and sampling parameters
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=32)

# 2. Initialize the LLM
llm = LLM(model="facebook/opt-125m")

# 3. Start timing
start_time = time.time()

# 4. Generate outputs
outputs = llm.generate(prompts, sampling_params)

# 5. End timing
end_time = time.time()
elapsed = end_time - start_time

# 6. Count prompt and generation tokens
total_prompt_tokens = sum(len(output.prompt_token_ids) for output in outputs)
total_generation_tokens = sum(len(output.outputs[0].token_ids) for output in outputs)

# 7. Compute throughput
prompt_throughput = total_prompt_tokens / elapsed
generation_throughput = total_generation_tokens / elapsed

print(f"Avg prompt throughput: {prompt_throughput:.2f} tokens/s")
print(f"Avg generation throughput: {generation_throughput:.2f} tokens/s")

This approach matches the logic in the official metrics logging and issue #19680.

Would you like more detail or an example for streaming APIs?

Sources:

Hint: Mention @RunLLM in the post for followups.

and also gpu kv cache usage is also 0 from the metric

how to solve it @RunLLM

If gpu kv cache usage is always 0 in the metrics, it usually means that either (1) no requests are running, (2) the cache is not being filled due to low concurrency or short prompts, or (3) logging/metrics are not enabled or not being updated. This is a common confusion, as discussed in issue #5626 and issue #9660.

How to solve it:

  • Make sure you are running enough concurrent and/or long requests to actually fill the KV cache. With only a few or short requests, usage will remain near zero.
  • Ensure you are not setting --disable-log-stats or similar flags that would suppress metric updates.
  • Check that you are reading the metric while requests are actively running; after all requests finish, usage may drop to zero as the cache is freed.
  • If using multiple GPUs, remember that more total cache is available, so the same workload will show a lower percentage (see issue #9660).

Would you like a code example to simulate higher KV cache usage?

Sources:

Hint: Mention @RunLLM in the post for followups.

then can we got stream generate ? @RunLLM

Yes, vLLM supports streaming generation. You can stream output tokens as they are generated using the AsyncLLM engine or by running the OpenAI-compatible API server and consuming the streaming endpoint. For Python, see the official async streaming example:

import asyncio
from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.sampling_params import RequestOutputKind
from vllm.v1.engine.async_llm import AsyncLLM

async def stream_response(engine, prompt, request_id):
    sampling_params = SamplingParams(
        max_tokens=100,
        temperature=0.8,
        top_p=0.95,
        output_kind=RequestOutputKind.DELTA,
    )
    async for output in engine.generate(
        request_id=request_id, prompt=prompt, sampling_params=sampling_params
    ):
        for completion in output.outputs:
            print(completion.text, end="", flush=True)
        if output.finished:
            print("\nGeneration complete!")
            break

async def main():
    engine_args = AsyncEngineArgs(model="meta-llama/Llama-3.2-1B-Instruct")
    engine = AsyncLLM.from_engine_args(engine_args)
    await stream_response(engine, "The future of AI is", "stream-example-1")
    engine.shutdown()

asyncio.run(main())

This code streams tokens as they are generated. For more, see async_llm_streaming.py.

Would you like more detail or an example for the OpenAI API server?

Sources:

Hint: Mention @RunLLM in the post for followups.

so can i directly got the gpu kv cache usage from something like metrics @RunLLM

Yes, you can directly get the GPU KV cache usage from the metrics returned by llm.get_metrics(). The metric name is typically vllm:kv_cache_usage_perc (or sometimes gpu_kv_cache_usage in older versions). You can filter and print it as follows:

for metric in llm.get_metrics():
    if metric.name in ("vllm:kv_cache_usage_perc", "gpu_kv_cache_usage"):
        print(f"GPU KV cache usage: {metric.value}")

This approach is shown in the official metrics example and metrics documentation.

Would you like more detail on interpreting this metric or troubleshooting if it always shows 0?

Sources:

Hint: Mention @RunLLM in the post for followups.