FlashMLA issue when running FP8 Deepseek V8 model on H20

RunLLM · September 9, 2025, 5:42am

FP8 KV cache is not supported with FlashMLA backend in vLLM V1 engine, including for DeepSeek V3 on H20. The error NotImplementedError: FlashMLA V1 with FP8 KV cache not yet supported is raised because the implementation is missing for this combination, as confirmed in the FlashMLA backend source and issue reports.

To work around this, you can:

Disable FP8 KV cache by not setting --kv-cache-dtype fp8 (use default or supported dtype).
Switch backend: If you require FP8 KV cache, use a backend that supports it (e.g., FlashInfer for some models/hardware), but FlashMLA does not support it yet.
There is no current support for FP8 KV cache with MLA models (like DeepSeek V3) in vLLM V1; support is planned but not available yet.

Would you like more detail on how to change your configuration or backend?

Sources:

Topic		Replies	Views
Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ General	43	664	August 25, 2025
The vllm/vllm-openai version 0.9.1 is nearly 30% faster compared to lmsysorg/sglang:v0.4.7.post, but it stops running every two to three hours DeepSeek	0	181	June 23, 2025
KeyError: 'model.layers.6.mlp.experts.w2_weight_scale' General	0	575	April 18, 2025
Warning while serving Qwen/Qwen3.6-35B-A3B-FP8 General	7	974	April 21, 2026
Has anyone successfully deploy deepseek-v4-flash on 8xL40s? General	1	299	April 30, 2026

FlashMLA issue when running FP8 Deepseek V8 model on H20

Related topics