vLLM is actively developing support for NVFP4 (NVIDIA FP4) KV cache storage. There is ongoing work to add CUDA kernels and backend integration for storing and reading KV cache in NVFP4 format, including correct handling of block scales and dequantization to higher-precision types (FP8/FP16/BF16) before QK and PV calculations. However, as of now, full end-to-end support for NVFP4 KV cache is not yet merged or generally available—current PRs focus on kernel-level support and partial backend wiring, with further integration and testing in progress. See the discussion and technical details in this PR.
For FP8 KV cache, vLLM already supports quantized storage and dequantization before attention computation, but NVFP4 is still experimental and not fully exposed in the mainline user API. If you need stable quantized KV cache, FP8 is currently the recommended option; NVFP4 support is expected to mature soon. Would you like more detail on the current NVFP4 implementation or usage examples?