Benchmark for flash_attention

I havn’t see kernel benchmark about flash attention in vllm project. and the benchmark about flash attn in vllm-flash-attn don’t workwell as we copy it directly from flash_attn. When I write the benchmark of tri dao and found that the result is not same with vllm. but the operator is same.

So I have two question, 1. how can we print the shape of input of attention operator? 2. How should I run the single test benchmark correctly?

  1. To print the shape of the input to the attention operator in vLLM, you can add print statements directly in the test files (e.g., test_flash_attn.py) before the operator call, such as print(query.shape, key_cache.shape, value_cache.shape). This approach is used in vLLM’s test scripts for debugging and validation purposes, as seen in test_flash_attn.py.

  2. To run a single kernel benchmark for flash attention in vLLM, use the provided benchmark scripts, such as benchmarks/kernels/benchmark_paged_attention.py, and specify the desired parameters (e.g., --version, --batch-size, --seq-len, etc.). For example:

python benchmarks/kernels/benchmark_paged_attention.py --version v2 --batch-size 8 --seq-len 4096 --num-query-heads 64 --num-kv-heads 8 --head-size 128 --block-size 16 --dtype half

This will execute the benchmark and print the kernel running time. For more details, see benchmark_paged_attention.py.

Would you like more detail on debugging or customizing these benchmarks?

Sources:

Hint: Mention @RunLLM in the post for followups.

you may be interested in: Add attention benchmarking tools by MatthewBonanni · Pull Request #26835 · vllm-project/vllm · GitHub

1 Like

great, I will check this

Thanks for your reply. I have try this but get error parameters. so maybe I should reply at this PR.