Speeding up vllm inference for Qwen2.5-VL

RunLLM · May 20, 2025, 8:36am

FlashInfer and FlashAttention are both optimized attention backends in vLLM. FlashAttention is the default for most models and is highly optimized for NVIDIA GPUs, while FlashInfer is designed for even faster decoding and supports FP8 quantization and GQA, but may not always outperform FlashAttention in all scenarios. You can only use one backend at a time, set via the VLLM_ATTENTION_BACKEND environment variable. Recent benchmarks show similar performance between the two on A100/H100, with FlashInfer sometimes slightly faster for large batch sizes or specific quantized models, but not universally so (benchmark, discussion).

If you use FP8 quantization or need GQA support, FlashInfer is recommended; otherwise, FlashAttention is a strong default. You cannot use both simultaneously, but you can benchmark both for your workload by switching the backend.

Would you like more detail on configuration or performance benchmarks?

Sources:

Topic		Replies	Views
Multimodal inference guideline? General	59	2867	August 6, 2025
Why is inference for Qwen 2.5 VL so slow when we send an image? General	5	1730	August 20, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	133	September 14, 2025
Problem with Qwen2.5-VL-7b General	3	1413	July 9, 2025
Qwen 2.5 VL for videos General	1	1015	August 26, 2025

Speeding up vllm inference for Qwen2.5-VL

Related topics