I found that Longest Prefix Matching (LPM) is used in SGLang as part of its cache-aware scheduling.
I also tested a similar implementation in vLLM, and it slightly improved throughput.
Has anyone else tried implementing LPM in vLLM or experimented with similar cache-aware scheduling strategies?
Yes, vLLM has recently added prefix cache-aware scheduling, which takes cached tokens into account when batching and scheduling requests. This approach improves KV cache utilization and can yield up to 50% throughput improvement in high prefix cache hit scenarios, similar to the benefits of LPM in SGLang. The implementation tracks cached blocks per sequence and adjusts scheduling budgets to prioritize uncached tokens, but does not explicitly mention using LPM as in SGLang. However, the effect is similar: maximizing cache reuse and throughput by aligning requests with shared prefixes. Benchmarks show significant improvements in both throughput and latency when prefix caching is enabled in vLLM, especially for workloads with high prefix sharing. For more details, see the discussion and benchmarks in Prefix Cache Aware Scheduling [1/n].
Would you like a step-by-step breakdown of how vLLM’s prefix cache-aware scheduling works and how it compares to SGLang’s LPM?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM Want to know how it compares to SGLang’s LPM
vLLM’s prefix cache-aware scheduling and SGLang’s Longest Prefix Matching (LPM) both aim to maximize cache reuse by batching requests with shared prefixes, but their implementations differ. SGLang explicitly uses LPM to group and schedule requests with the longest shared prefix, ensuring optimal cache hits. vLLM, on the other hand, tracks cached blocks per sequence and adjusts scheduling budgets to account for already-cached tokens, but does not explicitly perform LPM grouping; instead, it opportunistically leverages prefix cache hits during scheduling and block allocation. As a result, SGLang’s LPM may achieve slightly better cache hit rates in some scenarios, but vLLM’s recent improvements have closed much of the performance gap, as shown in benchmarks and discussions (Prefix Cache Aware Scheduling [1/n], BatchLLM for better shared prefix utilizing in offline scenarios).
If you want the absolute best cache hit rate and throughput for workloads with many shared prefixes, SGLang’s explicit LPM may have a slight edge. However, vLLM’s cache-aware scheduling now provides similar benefits in most practical cases, especially when requests are sorted or grouped by prefix (BatchLLM for better shared prefix utilizing in offline scenarios).
Would you like a detailed step-by-step comparison or code examples?
Sources:
Hint: Mention @RunLLM in the post for followups.