Building vLLM with Pytorch 2.9

FA3 seems to be broken on CUDA 13.0 with GH200. Can I build vLLM without FlashAttention if I’m using Flash Infer instead?

Yes, you can build vLLM without FlashAttention if you are using Flash Infer as your attention backend. vLLM supports multiple attention backends, including FLASHINFER, and you can select it by setting the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER. However, there are no pre-built vLLM wheels containing Flash Infer, so you must manually install Flash Infer in your environment before building vLLM. Refer to the Flash Infer official docs or the vLLM Dockerfile for installation instructions.

To summarize: FlashAttention is not required if you use Flash Infer, but you must ensure Flash Infer is installed and selected as the backend. For more, see vLLM Quickstart: On Attention Backends.

Would you like more detail on the build process or environment setup?

Sources:

Hint: Mention @RunLLM in the post for followups.