Can speculative decoding and prefix caching take effect simultaneously?

libin817927 · August 5, 2025, 3:29am

RunLLM · August 5, 2025, 3:30am

Yes, speculative decoding and prefix caching can be enabled together in vLLM, but their compatibility depends on the specific vLLM version and backend. As of recent updates, simultaneous use is supported with the flash-attn backend and requires careful configuration; earlier versions or other backends may not support both features together or may have bugs under high concurrency or with certain block managers. See PR #6144 and PR #6819 for details and limitations.

For best results, use the latest vLLM release, enable both features with the appropriate flags, and test under your workload. Some models or block managers may still have edge cases or require workarounds. According to issue #5167, making all features compatible is an ongoing effort.

Would you like more detail on configuration or known issues?

Sources:

libin817927 · August 5, 2025, 3:36am

INFO 08-05 11:33:17 [metrics.py:417] Avg prompt throughput: 1239.8 tokens/s, Avg generation throughput: 25.9 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.1%, CPU KV cache usage: 0.0%.

INFO 08-05 11:34:35 [metrics.py:439] Speculative metrics: Draft acceptance rate: 0.675, System efficiency: 0.531, Number of speculative tokens: 4, Number of accepted tokens: 8485, Number of draft tokens: 12564, Number of emitted tokens: 8347.

I don’t see the statistical information about the prefix caching hit ratio. The current version of vllm is 0.9.1. Does it not support the simultaneous activation of speculative decoding and prefix caching?

RunLLM · August 5, 2025, 3:36am

Prefix caching and speculative decoding can be enabled together in vLLM, but in some versions (including 0.9.1), prefix cache hit statistics may not be logged when speculative decoding is active. This is a known limitation, and the absence of prefix cache hit ratio in your logs does not necessarily mean prefix caching is inactive—just that its metrics are not reported when both features are used together. See issue #5167 and PR #6144.

If you need prefix cache hit stats with speculative decoding, you may need to check for updates or open an issue for your use case. Would you like more detail on workarounds or internal logging?

Sources:

issue #5167

Topic		Replies	Views
Does the vLLM v1 support Speculative Decoding now? V1 Feedback	4	308	March 25, 2025
Should vLLM consider prefix caching when chunked prefill is enabled? General	1	217	June 24, 2025
How to use speculative decoding? Speculative Decoding	3	566	May 1, 2025
Goodput Guided Speculative Decoding Speculative Decoding	2	188	April 19, 2025
Improving Speculative Decoding for Beginning Tokens & Structured Output Speculative Decoding	1	109	July 16, 2025

Can speculative decoding and prefix caching take effect simultaneously?

Related topics