It seems likely that the most reliable way to extract TTFT (Time To First Token), GPU KV cache usage, and related metrics is to use vLLM’s official benchmark scripts, which are designed to collect and print these statistics. Here’s a step-by-step guide:
1. Use the Official Benchmark Script
Run the benchmark script (e.g., benchmarks/benchmark_serving.py or vllm bench serve) with your desired model and parameters. For example:
vllm bench serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 128 \
--max-concurrency 8 \
--num-prompts 100
This will print a summary at the end, including TTFT, TPOT, ITL, and throughput. See benchmark example and Llama4-Scout recipe.
2. Enable and View Engine Metrics
While the benchmark runs, vLLM logs engine metrics (GPU KV cache usage, prefix cache hit rate, etc.) every few seconds if disable_log_stats is not set. These appear in the console as:
INFO [metrics.py] Avg prompt throughput: ... tokens/s, Avg generation throughput: ... tokens/s, Running: ... reqs, GPU KV cache usage: ...%, Prefix cache hit rate: ...%
If you want to collect these programmatically, you can scrape the /metrics Prometheus endpoint (see metrics design) or parse the log output.
3. Extract Batch Utilization and Scheduling Delay
- Batch Utilization: This is not directly printed, but you can infer it from the number of running requests and the
max_num_seqs parameter in the log lines.
- Avg Scheduling Delay: Not directly exposed, but queue times per request are available in the Prometheus metrics as
vllm:request_queue_time_seconds (see metrics loggers.py).
4. Save Detailed Results
For per-request TTFT and latency, use the --save-result flag (if available) with the benchmark script to save a JSON file containing detailed metrics for each request. You can then parse this file for TTFT and queue times.
Summary Table Example (from script output):
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 19.37
Request throughput (req/s): 51.63
Output token throughput (tok/s): 6628.49
---------------Time to First Token----------------
Mean TTFT (ms): 3819.32
Median TTFT (ms): 3076.68
P99 TTFT (ms): 12275.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 35.84
---------------Inter-token Latency----------------
Mean ITL (ms): 182.94
Would you like a code snippet for parsing the saved JSON results or for scraping the Prometheus metrics endpoint?
Sources:
Hint: Mention @RunLLM in the post for followups.