Questions about KV Cache Quantization

I’m currently trying to understand KV cache quantization in vLLM. From what I’ve learned, the data type of the KV cache is controlled by the kv_cache_dtype parameter. As I’m still learning about vLLM, I’m a bit confused about this design — why isn’t the data type of the KV cache aligned with that of the model weights?

For example, if I use model weights quantized to INT8, why can’t the KV cache also be set to INT8 (or at least FP8)?

I’ve read the following example documentation:
Quantized KV Cache — vLLM Documentation

From my current understanding, even with KV cache quantization, we still need to have a quantized model. So, if I’ve already applied model quantization using AWQ or GPTQ, can I directly set kv_cache_dtype="fp8" to enable KV cache quantization?

Of course, I still have much to learn about model quantization. I would really appreciate it if someone could help clarify this for me. Thank you very much!

You are correct that the KV cache data type in vLLM is controlled by the kv_cache_dtype parameter, and it is independent from the model weights’ data type. This design allows users to flexibly choose the optimal data type for the KV cache, which can have different performance and memory trade-offs compared to the model weights. For example, even if your model weights are quantized to INT8 (using AWQ or GPTQ), you can set kv_cache_dtype="fp8" to store the KV cache in FP8, which reduces memory usage and can improve throughput, as the cache is often a major memory bottleneck during inference. This separation is intentional because quantization methods like AWQ/GPTQ are weight-only and do not quantize activations or KV cache; they require dequantization to higher precision (e.g., FP16) for computation, so the cache can be quantized separately and more aggressively without affecting the model weights’ quantization scheme. See Quantized KV Cache — vLLM Documentation and GitHub Issue #1544.

If you have already quantized your model weights with AWQ or GPTQ, you can still set kv_cache_dtype="fp8" to enable FP8 KV cache quantization. This is supported and commonly used to further reduce memory usage, even when the model weights are quantized. However, note that AWQ and GPTQ do not quantize the KV cache themselves, and vLLM does not currently support INT4 KV cache due to hardware and kernel limitations; FP8 and INT8 are the main supported formats for cache quantization. For best accuracy with FP8 KV cache, you should use calibrated scaling factors as described in the documentation, but uncalibrated FP8 is also supported with a small accuracy trade-off. See Quantized KV Cache — vLLM Documentation, GitHub Issue #1544, and GitHub PR #1507.

Would you like a step-by-step explanation of how to enable and use FP8 KV cache quantization with a quantized model?

Sources: