Benchmark for flash_attention

cyk2018 · January 21, 2026, 11:06am

I havn’t see kernel benchmark about flash attention in vllm project. and the benchmark about flash attn in vllm-flash-attn don’t workwell as we copy it directly from flash_attn. When I write the benchmark of tri dao and found that the result is not same with vllm. but the operator is same.

So I have two question, 1. how can we print the shape of input of attention operator? 2. How should I run the single test benchmark correctly?

RunLLM · January 21, 2026, 11:06am

To print the shape of the input to the attention operator in vLLM, you can add print statements directly in the test files (e.g., test_flash_attn.py) before the operator call, such as print(query.shape, key_cache.shape, value_cache.shape). This approach is used in vLLM’s test scripts for debugging and validation purposes, as seen in test_flash_attn.py.
To run a single kernel benchmark for flash attention in vLLM, use the provided benchmark scripts, such as benchmarks/kernels/benchmark_paged_attention.py, and specify the desired parameters (e.g., --version, --batch-size, --seq-len, etc.). For example:

python benchmarks/kernels/benchmark_paged_attention.py --version v2 --batch-size 8 --seq-len 4096 --num-query-heads 64 --num-kv-heads 8 --head-size 128 --block-size 16 --dtype half

This will execute the benchmark and print the kernel running time. For more details, see benchmark_paged_attention.py.

Would you like more detail on debugging or customizing these benchmarks?

Sources:

Hint: Mention @RunLLM in the post for followups.

LucasWilkinson · January 22, 2026, 5:04am

you may be interested in: Add attention benchmarking tools by MatthewBonanni · Pull Request #26835 · vllm-project/vllm · GitHub

cyk2018 · January 22, 2026, 8:27am

great, I will check this

cyk2018 · January 22, 2026, 9:18am

Thanks for your reply. I have try this but get error parameters. so maybe I should reply at this PR.

Topic		Replies	Views
In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload General	12	199	June 24, 2025
Flash Attention version General	1	3967	September 27, 2025
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	571	May 22, 2025
Qwen2.5 VL开启flashinfer失败 Multi-modality	5	279	November 24, 2025
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) V1 Feedback	6	452	June 12, 2025

Benchmark for flash_attention

Related topics