V1 has lower end-to-end performance than V0(--num-scheduler-steps=8)

lianjie · June 10, 2025, 12:43pm

Testing Environment:

vllm version: 0.8.5
Operating System: Ubuntu 22.04
GPU: Nvidia A100

Performance Comparison:

For Llama/qwen2 series models, the end-to-end inference performance of V1 is 5% to 10% lower than that of V0.
Version 0 configuration: --num-scheduler-steps=8

Question:

Is there any community that has relevant performance comparison data?

tjtanaa · June 11, 2025, 8:25am

@lianjie Which backend are you using? FlashInfer Attention on V0 and V1?

RunLLM · June 11, 2025, 8:26am

vLLM supports the FlashInfer attention backend on both V0 and V1 engines. You can select it by setting the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER. However, some features and performance optimizations (like CUDA Graph support) are more mature in V1, and FlashInfer is not enabled by default due to ongoing performance tuning. You must install FlashInfer separately, as it is not bundled with vLLM by default. For details, see docs and discussion.

Would you like more detail on differences between V0 and V1 or how to configure FlashInfer?

Sources:

benchislett · June 11, 2025, 5:26pm

I’ve noticed that on H100/H200, V0 with multi-step is still a few % faster than V1, but I’ve only observed this with a single concurrent request. When running more than one request, I’ve observed that the gap closes and V1 is slightly faster. I don’t think there will be many easy fixes to improve V1 performance to close this gap.

RunLLM · June 11, 2025, 5:27pm

Your observation is correct: with a single concurrent request, V0 with multi-step can be a few percent faster than V1, especially on H100/H200, but as concurrency increases, V1 matches or slightly surpasses V0. This is a known and expected behavior due to architectural differences—V1 is optimized for high concurrency and scheduling efficiency, while V0’s multi-step mode can minimize per-step overhead for single requests. There are no simple fixes to make V1 consistently faster for single-request, multi-step scenarios, as discussed in this issue and related discussions.

Would you like a deeper technical breakdown of why this performance difference exists?

Sources:

lianjie · June 12, 2025, 7:07am

I used flash_attn2 and tested the batch sizes of 32, 64 and 128，thank you for your reply.

RunLLM · June 12, 2025, 7:07am

With FlashAttention 2 (flash_attn2), increasing batch size generally improves GPU utilization and throughput, especially for small models or short contexts. For batch sizes 32, 64, and 128, you should see higher throughput and better FLOPS utilization as batch size increases, but the gains may plateau due to memory or scheduling overheads. This is consistent with vLLM’s design and is discussed in this issue and optimization docs.

Would you like more detail on how to further tune batch size or interpret your results?

Sources:

https://docs.vllm.ai/en/stable/performance/optimization.html

Topic		Replies	Views
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) General	1	70	June 11, 2025
What is the Perf Difference between V0 Engine (--num-scheduler-steps) vs V1 Engine? General	1	230	May 29, 2025
vllm的V1为什么删除了multi step特性 General	3	234	June 11, 2025
Problem with Qwen2.5-VL-7b General	3	674	July 9, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	4906	June 27, 2025

V1 has lower end-to-end performance than V0(--num-scheduler-steps=8)

Related topics