Standalone draft model spec decode support in v0.x and v1

Standalone draft model is removed in v0.x releases (x>10): Speculative Decoding - vLLM

And v1 does not support this either.

Can I know why this is not supported anymore?

Speculative decoding with a standalone draft model is not supported in vLLM v0.10.0 and later, nor in v1, because the feature was deprecated and removed as part of a major architectural rework. The vLLM team prioritized other speculative decoding methods (like ngram, medusa, eagle, and mtp) due to technical challenges in supporting multi-model KV cache management and architectural differences between draft and target models. There is no current support or timeline for reintroducing standalone draft model speculative decoding in v1; users are advised to use alternative methods for now. See the official documentation and related issues for details: v0.14.0 spec decode docs, feature removal discussion, and v1 user guide.

Would you like more detail on the technical reasons or migration alternatives?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM is wrong.

Support for draft model was reintroduced to V1 this week in this PR: https://github.com/vllm-project/vllm/pull/24322 (I’m the PR author).

You can use it from main now, and it should be part of a release soon (I’m not familiar with the release cycle).

1 Like

You are correct—support for speculative decoding with a standalone draft model was reintroduced to vLLM V1 via PR #24322 and is now available on the main branch. This PR enables using a separate draft model for speculative decoding, with configuration and runtime support, and includes benchmarks and tests. It is not yet part of an official release, but can be used by installing vLLM from the latest main branch.

Would you like more detail on how to use this feature or its current limitations?

Sources:

Hint: Mention @RunLLM in the post for followups.