TurboQuant: KV Cache Compression

Hello, I would like to know that is there any plan to have a new version that can accept TurboQuant in –quantization parameter?

Thanks for your support.

TurboQuant support is being actively developed for vLLM, but as of now, TurboQuant is not available as a value for the --quantization parameter. Instead, TurboQuant is being integrated as a kv-cache quantization option via the --kv-cache-dtype parameter (e.g., --kv-cache-dtype turboquant), not as a general --quantization method for model weights. The mainline vLLM codebase does not yet support TurboQuant in --quantization, but there are open pull requests and ongoing work to add TurboQuant for KV cache compression, with benchmarks and implementation details available in the development discussions and PRs[^1][^2].

Would you like more detail on the current status, usage, or roadmap for TurboQuant in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.