KV Cache quantizing?

I’m sure this has been asked, but what is the proper way to use quant k,v cache on ampere (3090’s)?

–kv-cache-dtype fp8_e4m3 (or fp8) does not seem to save any VRAM and the seq length doesn’t increase.

Thanks!

On NVIDIA Ampere GPUs (like 3090), vLLM’s FP8 KV cache quantization (--kv-cache-dtype fp8 or fp8_e4m3) is not supported due to hardware and Triton kernel limitations. Attempting to use these options will not reduce VRAM usage or increase sequence length, and may result in errors or fallback to unquantized cache. FP8 KV cache is only supported on newer architectures (e.g., Hopper, Ada, H100, RTX 4090/6000 Ada) and AMD MI300, not on Ampere GPUs like the 3090. There is currently no supported quantized KV cache option for Ampere in vLLM[^1][^2][^3].

Would you like more detail on supported hardware or alternative memory-saving strategies?

Sources:

No, thank you though!

You’re welcome! If you have any more questions about vLLM or quantization in the future, feel free to ask.

Sources: