Performance degradation report (0.9.0.1 vs 0.10.0)

Hello,

We’ve observed a performance regression after upgrading from v0.9.0.1 to v0.10.0. The average latency increases under higher load with the default scheduler in the new version.

Environment

  • Instance: AWS p4d.24xlarge

  • Models: Gemma-3-27B, Gemma-3-27B-FP8

  • vLLM Versions: 0.9.0.1 vs. 0.10.0

Issue Description

As shown in the attached benchmark graphs, v0.10.0 with its default scheduler (blue line) shows a sharp increase in latency compared to v0.9.0.1 (green line) as we increase Requests Per Second (RPS).

  • Example (Gemma-3-27B): At 70 RPS, latency for v0.10.0 spiked to ~4709 ms, while v0.9.0.1 was at ~4028 ms.

  • Example (Gemma-3-27B-FP8): The degradation was even more severe, with v0.10.0 latency reaching ~5599 ms at 60 RPS, compared to ~2833 ms for v0.9.0.1.

Workaround

Enabling the async_scheduling feature in v0.10.0 (yellow line) resolves this issue. Its performance is comparable to, or even slightly better than, the v0.9.0.1 baseline across all tests.

While we have a workaround, we wanted to raise this issue to see if this performance degradation in the default scheduler is expected or if it might be an unintended regression.

Thank you.

This is a known issue: after upgrading from vLLM 0.9.0.1 to 0.10.0, users have observed increased average latency under high load with the default scheduler. Enabling the --async-scheduling flag in v0.10.0 resolves the regression, restoring or improving performance to match or exceed v0.9.0.1. This behavior has been reported by others and is not unique to your setup. The async scheduler is currently experimental but is recommended for high-concurrency scenarios to reduce CPU overhead and improve latency and throughput, though it may not support all features yet (e.g., structured outputs, speculative decoding, pipeline parallelism) according to the vLLM CLI documentation.

The performance degradation with the default scheduler in v0.10.0 is not explicitly documented as an intended change, so it may be an unintended regression. The async scheduler is the suggested workaround for now. Would you like a step-by-step breakdown of the relevant scheduler and async_scheduling options or more detail on the underlying cause?

Sources:

This seems not related.

Thank you for clarifying. To help you further, could you specify which part of my previous answer seemed unrelated to your issue? Are you looking for a technical explanation of why the default scheduler in v0.10.0 regresses in latency, or do you want guidance on scheduler configuration and best practices for your workload? Please provide more details so I can point you to the most relevant vLLM documentation or issue discussions.

Sources:

Both are using V1 engine.

Thank you for clarifying that both v0.9.0.1 and v0.10.0 are using the V1 engine. According to the available data, there are reports of performance regressions in vLLM V1 between versions, especially under high concurrency or load, with increased latency observed in newer releases (vLLM Issue #17540, vLLM Issue #19499). The async scheduler (--async-scheduling) is known to mitigate these regressions, as you observed, but the root cause of the default scheduler’s regression in v0.10.0 is not explicitly documented or explained in the official sources.

If you need a step-by-step technical breakdown of possible causes or tuning recommendations for V1, let me know.

Sources:

Can you share any more details of the workload? Ideally seeing the workload will help us to reproduce on our end

To help reproduce and analyze the performance regression, could you provide more details about your workload? Specifically, please share:

  • The prompt/input distribution (average input/output token lengths, prompt types, etc.)
  • The request pattern (steady RPS, bursty, etc.)
  • Any custom sampling parameters or generation configs used
  • The exact vLLM server launch command and relevant flags
  • Whether you use streaming or non-streaming completions

This information will help others benchmark under similar conditions and identify the root cause, as suggested in vLLM Issue #17540.

Sources:

Unfortunately, I cannot share the exact data as it is internal and confidential. However, I can provide the following statistics about the workload which may help your analysis:

  • Input tokens: Approximately 2,000

  • Output tokens: Approximately 100

I hope this level of detail is helpful. Please let me know if there is any other information I can provide within these constraints.

Thank you for providing the workload statistics. A workload with ~2,000 input tokens and ~100 output tokens per request is sufficient for synthetic reproduction, as vLLM’s benchmarking tools (e.g., benchmark_serving.py) allow you to specify these parameters for controlled testing (benchmark_serving.py usage). You can set --random-input-len 2000 and --random-output-len 100 to simulate your scenario.

If you can also specify the request rate (RPS) and whether requests are sent in bursts or at a steady rate, it will further improve reproducibility, as suggested in vLLM Issue #17540.

Sources: