Building vLLM with Pytorch 2.9

rajesh-s · October 15, 2025, 11:36pm

FA3 seems to be broken on CUDA 13.0 with GH200. Can I build vLLM without FlashAttention if I’m using Flash Infer instead?

RunLLM · October 15, 2025, 11:36pm

Yes, you can build vLLM without FlashAttention if you are using Flash Infer as your attention backend. vLLM supports multiple attention backends, including FLASHINFER, and you can select it by setting the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER. However, there are no pre-built vLLM wheels containing Flash Infer, so you must manually install Flash Infer in your environment before building vLLM. Refer to the Flash Infer official docs or the vLLM Dockerfile for installation instructions.

To summarize: FlashAttention is not required if you use Flash Infer, but you must ensure Flash Infer is installed and selected as the backend. For more, see vLLM Quickstart: On Attention Backends.

Would you like more detail on the build process or environment setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Flash Attention version General	1	2474	September 27, 2025
FlashInfer latest version is not working with vLLM General	11	1878	August 20, 2025
vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128 NVIDIA GPU Support	18	4192	January 13, 2026
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	473	May 22, 2025
Build VLLM without installing gcc? General	1	606	June 20, 2025

Building vLLM with Pytorch 2.9

Related topics