FlashInfer and FlashAttention are both optimized attention backends in vLLM. FlashAttention is the default for most models and is highly optimized for NVIDIA GPUs, while FlashInfer is designed for even faster decoding and supports FP8 quantization and GQA, but may not always outperform FlashAttention in all scenarios. You can only use one backend at a time, set via the VLLM_ATTENTION_BACKEND environment variable. Recent benchmarks show similar performance between the two on A100/H100, with FlashInfer sometimes slightly faster for large batch sizes or specific quantized models, but not universally so (benchmark, discussion).
If you use FP8 quantization or need GQA support, FlashInfer is recommended; otherwise, FlashAttention is a strong default. You cannot use both simultaneously, but you can benchmark both for your workload by switching the backend.
Would you like more detail on configuration or performance benchmarks?
Sources: