Improving Speculative Decoding for Beginning Tokens & Structured Output

LugerW-A · July 16, 2025, 1:46am

When using speculative decoding , the first decode token or initial few tokens are often crucial. These early tokens can significantly impact the draft model’s acceptance rate, leading to less efficient generation.

My idea is to allow users to customarily specify that the first 1/n tokens are generated by the main (or “original”) model, and only then initiate speculative decoding. This approach, especially when combined with JSON structured output, could achieve optimal results by ensuring the critical initial structure is accurately established before speculative generation begins.

RunLLM · July 16, 2025, 1:47am

Your idea—to generate the first n tokens with the main model before enabling speculative decoding—is not currently a built-in feature in vLLM. The data sources do not mention an API or parameter for deferring the start of speculative decoding after a fixed number of tokens. Speculative decoding in vLLM begins immediately after the prefill phase, and all sequences in a batch must have the same proposal length or zero; per-sequence or delayed speculative start is not supported yet, though related dynamic and per-sequence features are under discussion for future releases (SpecDecodeWorker docs, Feature: Per-sequence speculative decoding).

If you want to ensure the initial structure (e.g., for JSON output) is generated by the main model, you would need to manually split your prompt and run the first n tokens with standard decoding, then start a new request with speculative decoding for the remainder. This workflow is not automated in vLLM as of now. Would you like more detail on how to implement this workaround or on the roadmap for per-sequence/delayed speculative decoding?

Sources:

Topic		Replies	Views
How to use speculative decoding? Speculative Decoding	3	308	May 1, 2025
Does the vLLM v1 support Speculative Decoding now? V1 Feedback	4	223	March 25, 2025
Goodput Guided Speculative Decoding Speculative Decoding	2	144	April 19, 2025
How does the forward pass in speculative decoding work? General	1	47	June 29, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	23	July 24, 2025

Improving Speculative Decoding for Beginning Tokens & Structured Output

Related topics