Flash Attention version

nightlessbaron · September 27, 2025, 12:12am

What is the flash attention version that is compatible with the latest version of vllm?

How to install it?

RunLLM · September 27, 2025, 12:13am

The latest vLLM uses its own forked and bundled version of FlashAttention, so you do not need to install the upstream flash-attn package separately. vLLM automatically selects the appropriate FlashAttention version (v2 or v3) based on your hardware; for example, H100 defaults to FlashAttention v3. If you want to force a specific version, you can set the VLLM_FLASH_ATTN_VERSION environment variable, but this is rarely needed. For most users, simply installing vLLM via pip is sufficient, and FlashAttention support is included internally—no extra installation is required for CUDA systems. For ROCm (AMD) users, follow the ROCm-specific build instructions, which may involve building flash-attention from source as described in the docs.

For installation, use:

pip install vllm

or, for ROCm:

# See ROCm instructions for building flash-attention and vLLM from source

According to the documentation, do not install the upstream flash-attn package unless specifically instructed, as vLLM manages its own compatible version internally. For more details, see the official installation and backend docs.

Would you like more detail on ROCm or advanced configuration?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Building vLLM with Pytorch 2.9 General	1	366	October 15, 2025
FlashInfer latest version is not working with vLLM General	11	2174	August 20, 2025
How to apply FA4 on B200? NVIDIA GPU Support	3	324	December 18, 2025
vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128 NVIDIA GPU Support	18	5079	January 13, 2026
Benchmark for flash_attention Benchmarking	4	84	January 22, 2026

Flash Attention version

Related topics