It seems likely that you are comparing the performance of a model with and without a specific optimization (possibly speculative decoding or a similar technique) and noting the speedup achieved. The original paper might have reported a 3x improvement, but your results show a 44% speedup. This discrepancy could be due to differences in hardware, model configurations, or other experimental conditions.
Would you like more detail on how to potentially achieve the 3x speedup reported in the original paper?