Hi vLLm community ![]()
We’ve been exploring scheduling strategies to better support mixed workloads. We’ve observed a challenge similar to a supermarket checkout: when short/interactive requests (buying one item) get stuck behind long/batch requests (a full cart), the standard FCFS policy leads to high latency for everyone – while also leaving throughput on the table.
We’ve developed a proof-of-concept for a new adaptive scheduling policy that learns from the workload in real time. The goal is to dynamically prioritize requests to protect latency-sensitive interactions (like an “express lane”) while simultaneously increasing overall system throughput.
In our internal tests on mixed workloads, this approach yielded:
· Over 30% increase in throughput
· Significant reduction in tail latency for short prompts
(compared to the default scheduler)
We’ve visualized our metrics, and the trends are promising – happy to share once we’re further along.
We believe this direction – toward workload-aware scheduling – could be valuable for the vLLm ecosystem. We’re still in the early stages of formalizing the method but wanted to start a discussion around the broader challenge.
We’d love to hear your thoughts and experiences:
1. What’s the worst latency spike you’ve seen for due to a big batch job clogged the queue? We’re curious to hear some “war stories” from the field.
2. When it comes to scheduling, what does the community need most? (Just reply with the letter/emoji!)
A) The current FCFS is fine, keep it simple.
B)
A smarter, adaptive scheduler like the one we’re proposing.
C)
A fully pluggable interface so I can write my own custom scheduler.
D)
Something else? (Let us know!)
3. Is anyone else actively researching LLM scheduling policies? We’d love to connect and compare notes.
We’re also considering opening up a simulator for testing scheduling strategies – curious if others would find that useful.
Thanks for your thoughts!
— Bronislav Sidik