Can speculative decoding and prefix caching take effect simultaneously?
Yes, speculative decoding and prefix caching can be enabled together in vLLM, but their compatibility depends on the specific vLLM version and backend. As of recent updates, simultaneous use is supported with the flash-attn backend and requires careful configuration; earlier versions or other backends may not support both features together or may have bugs under high concurrency or with certain block managers. See PR #6144 and PR #6819 for details and limitations.
For best results, use the latest vLLM release, enable both features with the appropriate flags, and test under your workload. Some models or block managers may still have edge cases or require workarounds. According to issue #5167, making all features compatible is an ongoing effort.
Would you like more detail on configuration or known issues?
Sources:
INFO 08-05 11:33:17 [metrics.py:417] Avg prompt throughput: 1239.8 tokens/s, Avg generation throughput: 25.9 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.1%, CPU KV cache usage: 0.0%.
INFO 08-05 11:34:35 [metrics.py:439] Speculative metrics: Draft acceptance rate: 0.675, System efficiency: 0.531, Number of speculative tokens: 4, Number of accepted tokens: 8485, Number of draft tokens: 12564, Number of emitted tokens: 8347.
I don’t see the statistical information about the prefix caching hit ratio. The current version of vllm is 0.9.1. Does it not support the simultaneous activation of speculative decoding and prefix caching?
Prefix caching and speculative decoding can be enabled together in vLLM, but in some versions (including 0.9.1), prefix cache hit statistics may not be logged when speculative decoding is active. This is a known limitation, and the absence of prefix cache hit ratio in your logs does not necessarily mean prefix caching is inactive—just that its metrics are not reported when both features are used together. See issue #5167 and PR #6144.
If you need prefix cache hit stats with speculative decoding, you may need to check for updates or open an issue for your use case. Would you like more detail on workarounds or internal logging?
Sources: