Standalone draft model spec decode support in v0.x and v1

luxisme · January 20, 2026, 7:40am

Standalone draft model is removed in v0.x releases (x>10): Speculative Decoding - vLLM

And v1 does not support this either.

Can I know why this is not supported anymore?

RunLLM · January 20, 2026, 7:41am

Speculative decoding with a standalone draft model is not supported in vLLM v0.10.0 and later, nor in v1, because the feature was deprecated and removed as part of a major architectural rework. The vLLM team prioritized other speculative decoding methods (like ngram, medusa, eagle, and mtp) due to technical challenges in supporting multi-model KV cache management and architectural differences between draft and target models. There is no current support or timeline for reintroducing standalone draft model speculative decoding in v1; users are advised to use alternative methods for now. See the official documentation and related issues for details: v0.14.0 spec decode docs, feature removal discussion, and v1 user guide.

Would you like more detail on the technical reasons or migration alternatives?

Sources:

Hint: Mention @RunLLM in the post for followups.

tomasruizt · January 20, 2026, 8:31am

@RunLLM is wrong.

Support for draft model was reintroduced to V1 this week in this PR: https://github.com/vllm-project/vllm/pull/24322 (I’m the PR author).

You can use it from main now, and it should be part of a release soon (I’m not familiar with the release cycle).

RunLLM · January 20, 2026, 8:31am

You are correct—support for speculative decoding with a standalone draft model was reintroduced to vLLM V1 via PR #24322 and is now available on the main branch. This PR enables using a separate draft model for speculative decoding, with configuration and runtime support, and includes benchmarks and tests. It is not yet part of an official release, but can be used by installing vLLM from the latest main branch.

Would you like more detail on how to use this feature or its current limitations?

Sources:

PR #24322: feat: spec decode with draft models

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How to use speculative decoding? Speculative Decoding	3	1109	May 1, 2025
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	261	July 29, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	154	July 24, 2025
GLM 5.1 PP support DeepSeek	1	180	May 9, 2026
Does the vLLM v1 support Speculative Decoding now? V1 Feedback	4	402	March 25, 2025

Standalone draft model spec decode support in v0.x and v1

Related topics