Flash Attention 3 FP8 Support

haeunlee · April 30, 2025, 5:33pm

Is it normal that with FP8 kv cache execution time gets larger?
I thought FP8 flash attention kernel is supposed to be faster than FP16 kernel.

llm_custom = LLM(
model=model_path,
dtype=“auto”,
quantization=“fp8”, # Tried both on and off
kv_cache_dtype=“fp8”,
)

Thank you!

RunLLM · April 30, 2025, 5:33pm

Using FP8 for kv-cache can improve throughput by allowing more tokens to be stored, but it does not inherently reduce execution time or latency. The current FP8 KV cache implementation primarily benefits throughput, not latency, as it does not include fused dequantization and attention operations yet. Future releases may support quantized attention with hardware acceleration, which could provide additional performance benefits (Quantized KV Cache).

If you are experiencing increased execution time, it might be due to the overhead of quantization and dequantization processes, especially if the hardware does not fully support FP8 operations. Additionally, the FlashAttention kernel does not support FP8 kv-cache, so using FP8 kv-cache forces the use of the xFormers backend, which might not be as optimized for your specific use case (GitHub Issue).

Would you like more detail on how to potentially optimize your setup?

Sources:

Topic		Replies	Views
KV Cache quantizing? NVIDIA GPU Support	3	202	June 2, 2025
Questions about KV Cache Quantization General	1	154	June 30, 2025
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? KV-Cache	1	19	June 2, 2025
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	21	June 2, 2025
In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload General	12	57	June 24, 2025

Flash Attention 3 FP8 Support

Related topics