FlashMLA issue when running FP8 Deepseek V8 model on H20

FP8 KV cache is not supported with FlashMLA backend in vLLM V1 engine, including for DeepSeek V3 on H20. The error NotImplementedError: FlashMLA V1 with FP8 KV cache not yet supported is raised because the implementation is missing for this combination, as confirmed in the FlashMLA backend source and issue reports.

To work around this, you can:

  • Disable FP8 KV cache by not setting --kv-cache-dtype fp8 (use default or supported dtype).
  • Switch backend: If you require FP8 KV cache, use a backend that supports it (e.g., FlashInfer for some models/hardware), but FlashMLA does not support it yet.
  • There is no current support for FP8 KV cache with MLA models (like DeepSeek V3) in vLLM V1; support is planned but not available yet.

Would you like more detail on how to change your configuration or backend?

Sources: