I havn’t see kernel benchmark about flash attention in vllm project. and the benchmark about flash attn in vllm-flash-attn don’t workwell as we copy it directly from flash_attn. When I write the benchmark of tri dao and found that the result is not same with vllm. but the operator is same.
So I have two question, 1. how can we print the shape of input of attention operator? 2. How should I run the single test benchmark correctly?
To print the shape of the input to the attention operator in vLLM, you can add print statements directly in the test files (e.g., test_flash_attn.py) before the operator call, such as print(query.shape, key_cache.shape, value_cache.shape). This approach is used in vLLM’s test scripts for debugging and validation purposes, as seen in test_flash_attn.py.
To run a single kernel benchmark for flash attention in vLLM, use the provided benchmark scripts, such as benchmarks/kernels/benchmark_paged_attention.py, and specify the desired parameters (e.g., --version, --batch-size, --seq-len, etc.). For example: