V1 has lower end-to-end performance than V0(--num-scheduler-steps=8)

Testing Environment:

  • vllm version: 0.8.5
  • Operating System: Ubuntu 22.04
  • GPU: Nvidia A100

Performance Comparison:

  • For Llama/qwen2 series models, the end-to-end inference performance of V1 is 5% to 10% lower than that of V0.
  • Version 0 configuration: --num-scheduler-steps=8

Question:

Is there any community that has relevant performance comparison data?

Yes, there are multiple community reports comparing vLLM V1 and V0 performance. Several users have observed that, for Llama and Qwen2 series models, V1 can be 5–10% slower than V0 in end-to-end inference, especially when using similar configurations (e.g., --num-scheduler-steps=8 in V0). These findings are documented in detailed benchmark discussions and issue threads, with users sharing their own test results and configurations for reproducibility. For example, see the performance regression reports and community benchmarks in this GitHub issue and this Slack thread.

Would you like more detail on the specific benchmark setups or configuration differences?

Sources: