Why k=1 can achieve inference acceleration for DeepSeek MTP spec decode

Hi, I have a question I’d like to ask for clarification.
If k=1, that means MTP (Multi Token Prediction) can only predict one additional token ahead. How exactly does this achieve inference acceleration? Based on my understanding, the principle behind speeding up speculative decode is that the main model can verify in parallel the multiple tokens predicted by the draft model. But if it only predicts one additional token, wouldn’t verifying even that single token actually take extra time? Please feel free to correct me if I’ve misunderstood something. Thanks!

The table and text below from [Model][Speculative Decoding] DeepSeek MTP spec decode by luccafong · Pull Request #12755 · vllm-project/vllm · GitHub

Benchmark Results

The acceptance rate is 81% ~ 82.3% on R1 k=1.
The speedup depends on the QPS, with 1.63x speedup for QPS=1 and certain improvement with QPS<8 as shown in below table.

Results on various QPS

Draft TP=1

QPS Baseline TPOT k=1 TPOT Speedup
1 55.47 33.99 1.63x
2 57.58 48.8 1.18x
4 64.29 51.02 1.26x
6 122.93 108.15 1.14x
8 120.18 119.14 1.0x

With k=1 in MTP, the draft model predicts one token ahead, and the main model verifies it. Even though only one token is speculated, speculative decoding can still reduce latency by overlapping the draft and target model computations, minimizing CPU-GPU synchronization and data transfer overheads. This is especially effective at low QPS, as shown by the 1.63x speedup at QPS=1 in the benchmark table (vllm/pull/12755).

The acceleration comes not from verifying multiple tokens in parallel, but from batching and optimizing the workflow so that the draft model’s output is quickly checked by the main model without extra round-trips or idle time. As QPS increases, the benefit diminishes because the system becomes more compute-bound and less able to hide the draft model’s overhead (vllm/pull/12755). Would you like a step-by-step breakdown of the workflow and optimizations?

Sources: