Hi, I have a question I’d like to ask for clarification.
If k=1, that means MTP (Multi Token Prediction) can only predict one additional token ahead. How exactly does this achieve inference acceleration? Based on my understanding, the principle behind speeding up speculative decode is that the main model can verify in parallel the multiple tokens predicted by the draft model. But if it only predicts one additional token, wouldn’t verifying even that single token actually take extra time? Please feel free to correct me if I’ve misunderstood something. Thanks!
The table and text below from [Model][Speculative Decoding] DeepSeek MTP spec decode by luccafong · Pull Request #12755 · vllm-project/vllm · GitHub
Benchmark Results
The acceptance rate is 81% ~ 82.3% on R1 k=1.
The speedup depends on the QPS, with 1.63x speedup for QPS=1 and certain improvement with QPS<8 as shown in below table.
Results on various QPS
Draft TP=1
QPS | Baseline TPOT | k=1 TPOT | Speedup |
---|---|---|---|
1 | 55.47 | 33.99 | 1.63x |
2 | 57.58 | 48.8 | 1.18x |
4 | 64.29 | 51.02 | 1.26x |
6 | 122.93 | 108.15 | 1.14x |
8 | 120.18 | 119.14 | 1.0x |