Is FCFS Scheduling Holding Back vLLm's Performance in Production?

Hi vLLm community :waving_hand:

We’ve been exploring scheduling strategies to better support mixed workloads. We’ve observed a challenge similar to a supermarket checkout: when short/interactive requests (buying one item) get stuck behind long/batch requests (a full cart), the standard FCFS policy leads to high latency for everyone – while also leaving throughput on the table.

We’ve developed a proof-of-concept for a new adaptive scheduling policy that learns from the workload in real time. The goal is to dynamically prioritize requests to protect latency-sensitive interactions (like an “express lane”) while simultaneously increasing overall system throughput.

In our internal tests on mixed workloads, this approach yielded:

· Over 30% increase in throughput

· Significant reduction in tail latency for short prompts
(compared to the default scheduler)

We’ve visualized our metrics, and the trends are promising – happy to share once we’re further along.

We believe this direction – toward workload-aware scheduling – could be valuable for the vLLm ecosystem. We’re still in the early stages of formalizing the method but wanted to start a discussion around the broader challenge.

We’d love to hear your thoughts and experiences:

1. What’s the worst latency spike you’ve seen for due to a big batch job clogged the queue? We’re curious to hear some “war stories” from the field.

2. When it comes to scheduling, what does the community need most? (Just reply with the letter/emoji!)
A) The current FCFS is fine, keep it simple.
B) :brain: A smarter, adaptive scheduler like the one we’re proposing.
C) :hammer_and_wrench: A fully pluggable interface so I can write my own custom scheduler.
D) :thinking: Something else? (Let us know!)

3. Is anyone else actively researching LLM scheduling policies? We’d love to connect and compare notes.

We’re also considering opening up a simulator for testing scheduling strategies – curious if others would find that useful.

Thanks for your thoughts!
— Bronislav Sidik

Your observations are well-aligned with current vLLM community discussions and development. The FCFS (First-Come-First-Served) policy is known to cause head-of-line blocking, where short/interactive requests are delayed by long/batch jobs, leading to high tail latency and underutilized throughput—this is a widely reported pain point, especially in high-concurrency or mixed workload scenarios. Community members have reported latency spikes of several minutes (e.g., TTFT >250s) when large requests block the queue, and there is strong interest in more adaptive, workload-aware scheduling strategies to address this issue, including “express lane” or SJF (Shortest Job First)-like approaches, as well as pluggable/custom schedulers for advanced use cases. Recent work in the community includes chunked prefill, priority scheduling, and proposals for pluggable or externally controlled schedulers, with several RFCs and PRs discussing similar goals and approaches (source1, source2, source3, source4, source5).

The vLLM roadmap is moving toward supporting multiple scheduling policies (FCFS, priority, and potentially more adaptive or pluggable options), and the community is actively discussing the need for a more flexible, modular scheduler interface. There is also interest in external orchestration and simulation tools for scheduler evaluation. If you are interested in contributing your adaptive scheduler or a simulator, the community would likely welcome it, as there is clear demand for both smarter built-in policies and extensibility (source, source). Would you like a more detailed breakdown of current scheduler implementations, open RFCs, or guidance on how to contribute your work?

Sources:

Hi @RunLLM

Thank you so much for this incredibly helpful and detailed response! This is exactly the kind of feedback we were hoping for.

It’s very encouraging to see that our observations on head-of-line blocking and the need for workload-aware scheduling are so well-aligned with the community’s experience. The specific data point about TTFT >250s is a powerful confirmation of the problem’s severity.

We will definitely dive into the RFCs and discussions you linked (especially the ones on Priority Scheduling and a Replaceable Scheduler). It’s great to see so much active work in this area.

Our goal with the adaptive policy we’re developing is to build directly on these ideas and contribute a solution that is both powerful and easy to integrate via the pluggable interface. The idea of a simulator also seems to have strong community interest, which we’re excited about.

Thanks again for pointing us in the right direction. We’re looking forward to engaging more with these ongoing efforts and sharing our progress.

Best,
Bronislav

You’re very welcome! Your approach—building an adaptive, workload-aware scheduler that is pluggable and easy to integrate—directly addresses the most pressing needs identified by the vLLM community. There is clear momentum toward modular scheduler interfaces and simulation tools, so your contributions would be highly relevant and valued (Priority Scheduling RFC, Replaceable Scheduler RFC).

If you decide to open source your scheduler or simulator, consider engaging with the maintainers via GitHub issues or PRs, as this is the main channel for design discussions and feedback. The community is eager for practical solutions and benchmarks, so sharing your results and implementation details will likely spark productive collaboration. Would you like a step-by-step guide on contributing a new scheduler or simulator to vLLM?

Sources: