Improving Speculative Decoding for Beginning Tokens & Structured Output

When using speculative decoding , the first decode token or initial few tokens are often crucial. These early tokens can significantly impact the draft model’s acceptance rate, leading to less efficient generation.

My idea is to allow users to customarily specify that the first 1/n tokens are generated by the main (or “original”) model, and only then initiate speculative decoding. This approach, especially when combined with JSON structured output, could achieve optimal results by ensuring the critical initial structure is accurately established before speculative generation begins.

Your idea—to generate the first n tokens with the main model before enabling speculative decoding—is not currently a built-in feature in vLLM. The data sources do not mention an API or parameter for deferring the start of speculative decoding after a fixed number of tokens. Speculative decoding in vLLM begins immediately after the prefill phase, and all sequences in a batch must have the same proposal length or zero; per-sequence or delayed speculative start is not supported yet, though related dynamic and per-sequence features are under discussion for future releases (SpecDecodeWorker docs, Feature: Per-sequence speculative decoding).

If you want to ensure the initial structure (e.g., for JSON output) is generated by the main model, you would need to manually split your prompt and run the first n tokens with standard decoding, then start a new request with speculative decoding for the remainder. This workflow is not automated in vLLM as of now. Would you like more detail on how to implement this workaround or on the roadmap for per-sequence/delayed speculative decoding?

Sources: