What is the flash attention version that is compatible with the latest version of vllm?
How to install it?
What is the flash attention version that is compatible with the latest version of vllm?
How to install it?
The latest vLLM uses its own forked and bundled version of FlashAttention, so you do not need to install the upstream flash-attn package separately. vLLM automatically selects the appropriate FlashAttention version (v2 or v3) based on your hardware; for example, H100 defaults to FlashAttention v3. If you want to force a specific version, you can set the VLLM_FLASH_ATTN_VERSION environment variable, but this is rarely needed. For most users, simply installing vLLM via pip is sufficient, and FlashAttention support is included internally—no extra installation is required for CUDA systems. For ROCm (AMD) users, follow the ROCm-specific build instructions, which may involve building flash-attention from source as described in the docs.
For installation, use:
pip install vllm
or, for ROCm:
# See ROCm instructions for building flash-attention and vLLM from source
According to the documentation, do not install the upstream flash-attn package unless specifically instructed, as vLLM manages its own compatible version internally. For more details, see the official installation and backend docs.
Would you like more detail on ROCm or advanced configuration?
Sources:
Hint: Mention @RunLLM in the post for followups.