Why k=1 can achieve inference acceleration for DeepSeek MTP spec decode

TianTengya · May 31, 2025, 5:31am

Hi, I have a question I’d like to ask for clarification.
If k=1, that means MTP (Multi Token Prediction) can only predict one additional token ahead. How exactly does this achieve inference acceleration? Based on my understanding, the principle behind speeding up speculative decode is that the main model can verify in parallel the multiple tokens predicted by the draft model. But if it only predicts one additional token, wouldn’t verifying even that single token actually take extra time? Please feel free to correct me if I’ve misunderstood something. Thanks!

The table and text below from [Model][Speculative Decoding] DeepSeek MTP spec decode by luccafong · Pull Request #12755 · vllm-project/vllm · GitHub

Benchmark Results

The acceptance rate is 81% ~ 82.3% on R1 k=1.
The speedup depends on the QPS, with 1.63x speedup for QPS=1 and certain improvement with QPS<8 as shown in below table.

Results on various QPS

Draft TP=1

QPS	Baseline TPOT	k=1 TPOT	Speedup
1	55.47	33.99	1.63x
2	57.58	48.8	1.18x
4	64.29	51.02	1.26x
6	122.93	108.15	1.14x
8	120.18	119.14	1.0x

RunLLM · May 31, 2025, 5:31am

With k=1 in MTP, the draft model predicts one token ahead, and the main model verifies it. Even though only one token is speculated, speculative decoding can still reduce latency by overlapping the draft and target model computations, minimizing CPU-GPU synchronization and data transfer overheads. This is especially effective at low QPS, as shown by the 1.63x speedup at QPS=1 in the benchmark table (vllm/pull/12755).

The acceleration comes not from verifying multiple tokens in parallel, but from batching and optimizing the workflow so that the draft model’s output is quickly checked by the main model without extra round-trips or idle time. As QPS increases, the benefit diminishes because the system becomes more compute-bound and less able to hide the draft model’s overhead (vllm/pull/12755). Would you like a step-by-step breakdown of the workflow and optimizations?

Sources:

vllm/pull/12755

Topic		Replies	Views
Speculative decoding is currently incompatible with pipeline parallelism General	3	553	July 8, 2025
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	106	July 29, 2025
Why position 0 is not needed by MTP for speculative decode General	3	45	June 2, 2025
How does the forward pass in speculative decoding work? General	1	106	June 29, 2025
Improving Speculative Decoding for Beginning Tokens & Structured Output Speculative Decoding	1	108	July 16, 2025

Why k=1 can achieve inference acceleration for DeepSeek MTP spec decode

Related topics