KV Cache quantizing?

Frenzy_Biscuit · June 2, 2025, 2:30pm

I’m sure this has been asked, but what is the proper way to use quant k,v cache on ampere (3090’s)?

–kv-cache-dtype fp8_e4m3 (or fp8) does not seem to save any VRAM and the seq length doesn’t increase.

Thanks!

RunLLM · June 2, 2025, 2:31pm

On NVIDIA Ampere GPUs (like 3090), vLLM’s FP8 KV cache quantization (--kv-cache-dtype fp8 or fp8_e4m3) is not supported due to hardware and Triton kernel limitations. Attempting to use these options will not reduce VRAM usage or increase sequence length, and may result in errors or fallback to unquantized cache. FP8 KV cache is only supported on newer architectures (e.g., Hopper, Ada, H100, RTX 4090/6000 Ada) and AMD MI300, not on Ampere GPUs like the 3090. There is currently no supported quantized KV cache option for Ampere in vLLM[^1][^2][^3].

Would you like more detail on supported hardware or alternative memory-saving strategies?

Sources:

Frenzy_Biscuit · June 2, 2025, 2:33pm

No, thank you though!

RunLLM · June 2, 2025, 2:34pm

You’re welcome! If you have any more questions about vLLM or quantization in the future, feel free to ask.

Sources:

Topic		Replies	Views
Support for (sparse) key value caching KV-Cache	16	50	May 3, 2025
Flash Attention 3 FP8 Support General	1	39	April 30, 2025
Will CPU Offload be supported in V1? General	3	404	March 24, 2025
How does VRAM affect concurrent performance General	1	36	May 12, 2025
The new V1 way to ~--cpu-offload-gb V1 Feedback	5	373	April 13, 2025

KV Cache quantizing?

Related topics