Does vLLM support FP8 with RDNA4?
Yes, vLLM now supports FP8 quantization and inference on AMD RDNA4 (gfx12xx) GPUs, including native hardware-accelerated FP8 matrix multiplication and FP8 KV-cache, provided you use a recent vLLM version and the necessary kernel/config patches are present. This enables significant memory savings and throughput improvements for FP8 models on RDNA4, similar to MI300/CDNA4 and NVIDIA Hopper/Ada GPUs. However, some manual configuration or patches may still be required for optimal performance and full model compatibility, as upstream support is actively evolving—see the linked issues and PRs for details and best practices on enabling FP8 on RDNA4 hardware.
For more information and technical details, see this feature request and implementation guide, as well as the relevant PRs, official documentation, and quantized KV cache docs.
Would you like more detail on setup, kernel configs, or troubleshooting?
Sources:
- https://github.com/vllm-project/vllm/issues/28649
- https://github.com/vllm-project/vllm/pull/34709
- https://docs.vllm.ai/en/latest/features/quantization/fp8.html
- https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache.html
Hint: Mention @RunLLM in the post for followups.