TurboQuant: KV Cache Compression

ghazal-bh · March 31, 2026, 7:48am

Hello, I would like to know that is there any plan to have a new version that can accept TurboQuant in –quantization parameter?

Thanks for your support.

RunLLM · March 31, 2026, 7:48am

TurboQuant support is being actively developed for vLLM, but as of now, TurboQuant is not available as a value for the --quantization parameter. Instead, TurboQuant is being integrated as a kv-cache quantization option via the --kv-cache-dtype parameter (e.g., --kv-cache-dtype turboquant), not as a general --quantization method for model weights. The mainline vLLM codebase does not yet support TurboQuant in --quantization, but there are open pull requests and ongoing work to add TurboQuant for KV cache compression, with benchmarks and implementation details available in the development discussions and PRs[^1][^2].

Would you like more detail on the current status, usage, or roadmap for TurboQuant in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

uameer · April 11, 2026, 1:53am

I’ve been modeling a 0.12-bit entropy-gated exit protocol to address the quadratic compute waste in 1T+ models. It solves the CUDA sync penalty via speculative branching. Would this fit into the current roadmap for PagedAttention optimization?

Topic		Replies	Views
Support for (sparse) key value caching KV-Cache	16	589	May 3, 2025
Questions about KV Cache Quantization General	1	889	June 30, 2025
A bit of frustration with Quantization Quantization	5	701	October 14, 2025
Custom KV cache implementation General	1	336	July 19, 2025
NVFP4 Support In Attention KV-Cache	1	498	March 16, 2026

TurboQuant: KV Cache Compression

Related topics