How to apply FA4 on B200?

Wryt · December 18, 2025, 8:01am

I exported the following environment variables on 8xB200:

VLLM_ATTENTION_BACKEND=FLASH_ATTN
VLLM_FLASH_ATTN_VERSION=4

But vllm reported error like these:

(Worker_TP6 pid=10328) ERROR 12-17 23:53:18 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 308, in flash_attn_varlen_func
(Worker_TP6 pid=10328) ERROR 12-17 23:53:18 [multiproc_executor.py:822]     raise ValueError(f"Unsupported FA version: {fa_version}")
(Worker_TP6 pid=10328) ERROR 12-17 23:53:18 [multiproc_executor.py:822] ValueError: Unsupported FA version: None

Does vllm support FA4 for sm100 like sglang’s --prefill-attention-backend fa4 ?

My vllm version is 0.12.0 and CUDA is 13.0 with Nvidia driver 580.82.07.

RunLLM · December 18, 2025, 8:01am

vLLM does not support FlashAttention 4 (FA4) as an attention backend; only FA2 and FA3 are supported, and specifying FA4 will result in an “Unsupported FA version” error. The environment variable VLLM_FLASH_ATTN_VERSION only accepts values 2 or 3, and there is no mention of FA4 support for sm100 (B200) in the vLLM code or documentation as of version 0.12.0. The error you encountered is expected when using an unsupported FA version value[^1^][^2^][^3^].

If you need FA4-like features, you must use a different backend or framework (such as sglang) that explicitly supports FA4. For vLLM, set VLLM_FLASH_ATTN_VERSION=3 for Hopper (sm90) or compatible architectures, or use the recommended backend for your hardware. Would you like more detail on configuring vLLM attention backends?

Sources:

Hint: Mention @RunLLM in the post for followups.

Wryt · December 18, 2025, 8:07am

Does vLLM have plans to support FA4 on sm100? If so, roughly when would that be? @RunLLM

RunLLM · December 18, 2025, 8:08am

It seems likely that vLLM is working on integrating FlashAttention 4 (FA4), as there is an open pull request for FA4 integration, but as of now, FA4 is not supported and the code path is unreachable due to version-selection logic that only allows FA2 or FA3. There is no official timeline or release date for FA4 support on sm100 (B200) in vLLM, and enabling FA4 currently triggers an error[^1^].

Would you like more detail on the current development status or how to track future updates?

Sources:

vllm PR: FA4 Integration

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Flash Attention version General	1	3287	September 27, 2025
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	517	May 22, 2025
Building vLLM with Pytorch 2.9 General	1	365	October 15, 2025
Build issues when serving gpt-oss-20B on Tesla T4 GPUs with vLLM General	9	700	September 30, 2025
FlashInfer latest version is not working with vLLM General	11	2165	August 20, 2025

How to apply FA4 on B200?

Related topics