Performance degradation report (0.9.0.1 vs 0.10.0)

swoh91 · August 13, 2025, 9:16am

Hello,

We’ve observed a performance regression after upgrading from v0.9.0.1 to v0.10.0. The average latency increases under higher load with the default scheduler in the new version.

Environment

Instance: AWS p4d.24xlarge
Models: Gemma-3-27B, Gemma-3-27B-FP8
vLLM Versions: 0.9.0.1 vs. 0.10.0

Issue Description

As shown in the attached benchmark graphs, v0.10.0 with its default scheduler (blue line) shows a sharp increase in latency compared to v0.9.0.1 (green line) as we increase Requests Per Second (RPS).

Example (Gemma-3-27B): At 70 RPS, latency for v0.10.0 spiked to ~4709 ms, while v0.9.0.1 was at ~4028 ms.
Example (Gemma-3-27B-FP8): The degradation was even more severe, with v0.10.0 latency reaching ~5599 ms at 60 RPS, compared to ~2833 ms for v0.9.0.1.

Workaround

Enabling the async_scheduling feature in v0.10.0 (yellow line) resolves this issue. Its performance is comparable to, or even slightly better than, the v0.9.0.1 baseline across all tests.

While we have a workaround, we wanted to raise this issue to see if this performance degradation in the default scheduler is expected or if it might be an unintended regression.

Thank you.

RunLLM · August 13, 2025, 9:16am

This is a known issue: after upgrading from vLLM 0.9.0.1 to 0.10.0, users have observed increased average latency under high load with the default scheduler. Enabling the --async-scheduling flag in v0.10.0 resolves the regression, restoring or improving performance to match or exceed v0.9.0.1. This behavior has been reported by others and is not unique to your setup. The async scheduler is currently experimental but is recommended for high-concurrency scenarios to reduce CPU overhead and improve latency and throughput, though it may not support all features yet (e.g., structured outputs, speculative decoding, pipeline parallelism) according to the vLLM CLI documentation.

The performance degradation with the default scheduler in v0.10.0 is not explicitly documented as an intended change, so it may be an unintended regression. The async scheduler is the suggested workaround for now. Would you like a step-by-step breakdown of the relevant scheduler and async_scheduling options or more detail on the underlying cause?

Sources:

swoh91 · August 14, 2025, 8:28am

This seems not related.

RunLLM · August 14, 2025, 8:29am

Thank you for clarifying. To help you further, could you specify which part of my previous answer seemed unrelated to your issue? Are you looking for a technical explanation of why the default scheduler in v0.10.0 regresses in latency, or do you want guidance on scheduler configuration and best practices for your workload? Please provide more details so I can point you to the most relevant vLLM documentation or issue discussions.

Sources:

swoh91 · August 14, 2025, 8:32am

Both are using V1 engine.

RunLLM · August 14, 2025, 8:32am

Thank you for clarifying that both v0.9.0.1 and v0.10.0 are using the V1 engine. According to the available data, there are reports of performance regressions in vLLM V1 between versions, especially under high concurrency or load, with increased latency observed in newer releases (vLLM Issue #17540, vLLM Issue #19499). The async scheduler (--async-scheduling) is known to mitigate these regressions, as you observed, but the root cause of the default scheduler’s regression in v0.10.0 is not explicitly documented or explained in the official sources.

If you need a step-by-step technical breakdown of possible causes or tuning recommendations for V1, let me know.

Sources:

robertgshaw2-redhat · August 17, 2025, 8:55pm

Can you share any more details of the workload? Ideally seeing the workload will help us to reproduce on our end

RunLLM · August 17, 2025, 8:55pm

To help reproduce and analyze the performance regression, could you provide more details about your workload? Specifically, please share:

The prompt/input distribution (average input/output token lengths, prompt types, etc.)
The request pattern (steady RPS, bursty, etc.)
Any custom sampling parameters or generation configs used
The exact vLLM server launch command and relevant flags
Whether you use streaming or non-streaming completions

This information will help others benchmark under similar conditions and identify the root cause, as suggested in vLLM Issue #17540.

Sources:

https://github.com/vllm-project/vllm/issues/17540

swoh91 · August 18, 2025, 6:53am

Unfortunately, I cannot share the exact data as it is internal and confidential. However, I can provide the following statistics about the workload which may help your analysis:

Input tokens: Approximately 2,000
Output tokens: Approximately 100

I hope this level of detail is helpful. Please let me know if there is any other information I can provide within these constraints.

RunLLM · August 18, 2025, 6:54am

Thank you for providing the workload statistics. A workload with ~2,000 input tokens and ~100 output tokens per request is sufficient for synthetic reproduction, as vLLM’s benchmarking tools (e.g., benchmark_serving.py) allow you to specify these parameters for controlled testing (benchmark_serving.py usage). You can set --random-input-len 2000 and --random-output-len 100 to simulate your scenario.

If you can also specify the request rate (RPS) and whether requests are sent in bursts or at a steady rate, it will further improve reproducibility, as suggested in vLLM Issue #17540.

Sources:

Topic		Replies	Views
Vllm 0.10.1 v1 benchmark Only a part of the requests can be processed before it gets stuck General	1	50	November 4, 2025
Vllm 0.10.1 v1 benchmark stuck General	1	60	November 5, 2025
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) General	1	126	June 11, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	67	November 10, 2025
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) V1 Feedback	6	360	June 12, 2025

Performance degradation report (0.9.0.1 vs 0.10.0)

Environment

Issue Description

Related topics