RDNA4 (gfx1201) native FP8 support is not yet fully merged into vLLM mainline; it requires local patches and kernel config files as described in community discussions. The key PRs for running amd/gpt-oss-20b in FP8 on AMD (including R9700) are #29008 (amd-quark format/model loading and emulation) and #31962 (MoE computation order fix)—both are merged, but you may need to use vLLM 0.15+ and ensure your local repo includes these changes. For RDNA4, you must also add kernel configs and ensure AITER is disabled (VLLM_ROCM_USE_AITER=0), as upstream support for RDNA4-specific WMMA is not yet in main.
FP8 models successfully run by the community on AMD include Qwen3, Mixtral, and GPT-OSS variants, with Qwen3-30B and GPT-OSS-20B cited as working well on R9700 after applying the above patches and configs. There is no single PR tracking full RDNA4 FP8 support; progress is documented in issues like #28649 and related discussions. For step-by-step details, see the implementation notes and config requirements in those issues and PRs.
Would you like a detailed step-by-step guide for patching and running FP8 models on R9700?
Sources:
- #29008 ROCm Quantization GPT_OSS in amd-quark format
- #31962 Kernel MoE computation order fix
- #28649 RDNA4 FP8 Patch Issue
Hint: Mention @RunLLM in the post for followups.