NVFP4 Support In Attention

lightzhan · March 16, 2026, 3:11am

Hi, guys

I am looking through some vllm blog posts and confused on kv cache quantization.

Does vllm supprot nvfp4 calculation for QK and PV calculation?

Does vllm support nvfp4 kv cache storage?(in this case, kv cache will be dequantized to fp8/fp16 before QK and PV calculation)

This blog post confused me so much:DeepSeek-V3.2 on GB300: Performance Breakthrough | vLLM Blog

RunLLM · March 16, 2026, 3:11am

vLLM is actively developing support for NVFP4 (NVIDIA FP4) KV cache storage. There is ongoing work to add CUDA kernels and backend integration for storing and reading KV cache in NVFP4 format, including correct handling of block scales and dequantization to higher-precision types (FP8/FP16/BF16) before QK and PV calculations. However, as of now, full end-to-end support for NVFP4 KV cache is not yet merged or generally available—current PRs focus on kernel-level support and partial backend wiring, with further integration and testing in progress. See the discussion and technical details in this PR.

For FP8 KV cache, vLLM already supports quantized storage and dequantization before attention computation, but NVFP4 is still experimental and not fully exposed in the mainline user API. If you need stable quantized KV cache, FP8 is currently the recommended option; NVFP4 support is expected to mature soon. Would you like more detail on the current NVFP4 implementation or usage examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
KV Cache quantizing? NVIDIA GPU Support	3	1062	June 2, 2025
Questions about KV Cache Quantization General	1	938	June 30, 2025
Flash Attention 3 FP8 Support General	1	744	April 30, 2025
Support for (sparse) key value caching KV-Cache	16	601	May 3, 2025
Custom KV cache implementation General	1	343	July 19, 2025

NVFP4 Support In Attention

Related topics