Why vllm does not support LMP?

Binwon · October 23, 2025, 9:28am

I found that Longest Prefix Matching (LPM) is used in SGLang as part of its cache-aware scheduling.
I also tested a similar implementation in vLLM, and it slightly improved throughput.
Has anyone else tried implementing LPM in vLLM or experimented with similar cache-aware scheduling strategies?

RunLLM · October 23, 2025, 9:29am

Yes, vLLM has recently added prefix cache-aware scheduling, which takes cached tokens into account when batching and scheduling requests. This approach improves KV cache utilization and can yield up to 50% throughput improvement in high prefix cache hit scenarios, similar to the benefits of LPM in SGLang. The implementation tracks cached blocks per sequence and adjusts scheduling budgets to prioritize uncached tokens, but does not explicitly mention using LPM as in SGLang. However, the effect is similar: maximizing cache reuse and throughput by aligning requests with shared prefixes. Benchmarks show significant improvements in both throughput and latency when prefix caching is enabled in vLLM, especially for workloads with high prefix sharing. For more details, see the discussion and benchmarks in Prefix Cache Aware Scheduling [1/n].

Would you like a step-by-step breakdown of how vLLM’s prefix cache-aware scheduling works and how it compares to SGLang’s LPM?

Sources:

https://github.com/vllm-project/vllm/pull/10128

Hint: Mention @RunLLM in the post for followups.

Binwon · October 23, 2025, 9:32am

@RunLLM Want to know how it compares to SGLang’s LPM

RunLLM · October 23, 2025, 9:33am

vLLM’s prefix cache-aware scheduling and SGLang’s Longest Prefix Matching (LPM) both aim to maximize cache reuse by batching requests with shared prefixes, but their implementations differ. SGLang explicitly uses LPM to group and schedule requests with the longest shared prefix, ensuring optimal cache hits. vLLM, on the other hand, tracks cached blocks per sequence and adjusts scheduling budgets to account for already-cached tokens, but does not explicitly perform LPM grouping; instead, it opportunistically leverages prefix cache hits during scheduling and block allocation. As a result, SGLang’s LPM may achieve slightly better cache hit rates in some scenarios, but vLLM’s recent improvements have closed much of the performance gap, as shown in benchmarks and discussions (Prefix Cache Aware Scheduling [1/n], BatchLLM for better shared prefix utilizing in offline scenarios).

If you want the absolute best cache hit rate and throughput for workloads with many shared prefixes, SGLang’s explicit LPM may have a slight edge. However, vLLM’s cache-aware scheduling now provides similar benefits in most practical cases, especially when requests are sorted or grouped by prefix (BatchLLM for better shared prefix utilizing in offline scenarios).

Would you like a detailed step-by-step comparison or code examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Should vLLM consider prefix caching when chunked prefill is enabled? General	1	299	June 24, 2025
Question on Advanced vLLM Use Case: Distributed Prefix Caching for a CAG Evaluation Framework KV-Cache	1	85	October 15, 2025
Scheduler in vllm Features	1	293	June 26, 2025
VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit General	25	238	December 17, 2025
Where to start for implementing custom memory-block–aware scheduling in vLLM? General	3	40	December 26, 2025

Why vllm does not support LMP?

Related topics