What is the Perf Difference between V0 Engine (--num-scheduler-steps) vs V1 Engine?

Has anyone ever tried comparing the perf of V0 --num-scheduler-steps vs V1 ? Do we expect V1 is better than V0 --num-scheduler-steps?

Yes, several users have compared vLLM V1 to V0 with --num-scheduler-steps (multi-step scheduling). Results show that V0 with multi-step scheduling can be faster for single requests, but V1 generally offers better performance for high concurrency and long-context scenarios. For example, in one benchmark, V0 with multi-step scheduling had lower latency for a single request, but V1 outperformed V0 in concurrent throughput and efficiency, especially with long prompts or many parallel requests. However, the exact performance difference can depend on workload and environment specifics. See detailed benchmarks and discussion in this issue and this issue.

Would you like a step-by-step breakdown of the benchmark results or more details on specific scenarios?

Sources: