Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, “Optimizing Speculative Decoding for Serving Large Language Models Using Goodput”, I’m curious when will be implemented in vLLM?
The implementation of the SmartSpec method from the paper “Optimizing Speculative Decoding for Serving Large Language Models Using Goodput” is planned for integration into vLLM. According to a GitHub issue, the vLLM team has already implemented SmartSpec on a forked version with batch expansion and plans to integrate it into the main repository soon. The first step involves removing batch expansion, and community efforts are needed to improve speculative decoding performance.
Would you like more detail on the steps involved in this integration?
Sources:
The integration of the SmartSpec method into vLLM involves several steps. Initially, the vLLM team has implemented SmartSpec on a forked version with batch expansion. The first step towards integrating it into the main repository is to remove batch expansion. This is necessary to streamline the speculative decoding process and improve performance. Additionally, community efforts are required to enhance speculative decoding performance and implement tree-style speculative decoding.
Would you like more detailed information on any specific aspect of this integration process?
Sources: