This is a known issue: after upgrading from vLLM 0.9.0.1 to 0.10.0, users have observed increased average latency under high load with the default scheduler. Enabling the --async-scheduling flag in v0.10.0 resolves the regression, restoring or improving performance to match or exceed v0.9.0.1. This behavior has been reported by others and is not unique to your setup. The async scheduler is currently experimental but is recommended for high-concurrency scenarios to reduce CPU overhead and improve latency and throughput, though it may not support all features yet (e.g., structured outputs, speculative decoding, pipeline parallelism) according to the vLLM CLI documentation.
The performance degradation with the default scheduler in v0.10.0 is not explicitly documented as an intended change, so it may be an unintended regression. The async scheduler is the suggested workaround for now. Would you like a step-by-step breakdown of the relevant scheduler and async_scheduling options or more detail on the underlying cause?
Sources: