[Spec Decode] Why does the acceptance rate look close to the paper, but the throughput is still not high?

wayne · April 21, 2025, 6:28am

ori:
bs 1 tp=1: [05:55<00:00, 4.45s/it, est. speed input: 17.02 toks/s, output: 46.43 toks/s

with ealge:
K=4 AL: 2.06
bs 1 tp=1 k=4: [04:07<00:00, 3.09s/it, est. speed input: 24.49 toks/s, output: 66.86 toks/s

speed up: 44%
The original paper looks 3x better

RunLLM · April 21, 2025, 6:29am

It seems likely that you are comparing the performance of a model with and without a specific optimization (possibly speculative decoding or a similar technique) and noting the speedup achieved. The original paper might have reported a 3x improvement, but your results show a 44% speedup. This discrepancy could be due to differences in hardware, model configurations, or other experimental conditions.

Would you like more detail on how to potentially achieve the 3x speedup reported in the original paper?

Sources:

Topic		Replies	Views
Spec decode with eagle get very low Draft acceptance rate Speculative Decoding	1	389	April 25, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	7310	June 27, 2025
Unexpected performance gap when reproducing PR #25337 benchmark results on A100 PCIe using Qwen3-VL-30B-A3B-FP8 General	1	308	October 15, 2025
Why k=1 can achieve inference acceleration for DeepSeek MTP spec decode General	1	161	May 31, 2025
Vllm throughput less on 7B in comparison to 32B General	1	133	June 25, 2025

[Spec Decode] Why does the acceptance rate look close to the paper, but the throughput is still not high?

Related topics