Flash Attention 3 FP8 Support

Is it normal that with FP8 kv cache execution time gets larger?
I thought FP8 flash attention kernel is supposed to be faster than FP16 kernel.

llm_custom = LLM(
model=model_path,
dtype=“auto”,
quantization=“fp8”, # Tried both on and off
kv_cache_dtype=“fp8”,
)

Thank you!

Using FP8 for kv-cache can improve throughput by allowing more tokens to be stored, but it does not inherently reduce execution time or latency. The current FP8 KV cache implementation primarily benefits throughput, not latency, as it does not include fused dequantization and attention operations yet. Future releases may support quantized attention with hardware acceleration, which could provide additional performance benefits (Quantized KV Cache).

If you are experiencing increased execution time, it might be due to the overhead of quantization and dequantization processes, especially if the hardware does not fully support FP8 operations. Additionally, the FlashAttention kernel does not support FP8 kv-cache, so using FP8 kv-cache forces the use of the xFormers backend, which might not be as optimized for your specific use case (GitHub Issue).

Would you like more detail on how to potentially optimize your setup?

Sources: