vLLM supports the FlashInfer attention backend on both V0 and V1 engines. You can select it by setting the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER. However, some features and performance optimizations (like CUDA Graph support) are more mature in V1, and FlashInfer is not enabled by default due to ongoing performance tuning. You must install FlashInfer separately, as it is not bundled with vLLM by default. For details, see docs and discussion.
Would you like more detail on differences between V0 and V1 or how to configure FlashInfer?
I’ve noticed that on H100/H200, V0 with multi-step is still a few % faster than V1, but I’ve only observed this with a single concurrent request. When running more than one request, I’ve observed that the gap closes and V1 is slightly faster. I don’t think there will be many easy fixes to improve V1 performance to close this gap.
Your observation is correct: with a single concurrent request, V0 with multi-step can be a few percent faster than V1, especially on H100/H200, but as concurrency increases, V1 matches or slightly surpasses V0. This is a known and expected behavior due to architectural differences—V1 is optimized for high concurrency and scheduling efficiency, while V0’s multi-step mode can minimize per-step overhead for single requests. There are no simple fixes to make V1 consistently faster for single-request, multi-step scenarios, as discussed in this issue and related discussions.
Would you like a deeper technical breakdown of why this performance difference exists?
With FlashAttention 2 (flash_attn2), increasing batch size generally improves GPU utilization and throughput, especially for small models or short contexts. For batch sizes 32, 64, and 128, you should see higher throughput and better FLOPS utilization as batch size increases, but the gains may plateau due to memory or scheduling overheads. This is consistent with vLLM’s design and is discussed in this issue and optimization docs.
Would you like more detail on how to further tune batch size or interpret your results?