Hello everyone,
I’m currently working with vllm-rocm, a vLLM fork designed for AMD GPU devices.
I’m running Deepseek v3 using the AITER MLA backend (an AMD package that enhances Deepseek model performance on AMD GPUs through vLLM).
However, AITER only supports a block size of 1, and with this configuration, I’ve noticed a significant drop in APC performance.
My question is: Does the KV-cache block size impact APC performance?
In particular, should I expect degraded performance when using a small block size (e.g., 1)?
Thanks, and best regards
It seems likely that KV-cache block size does impact Automatic Prefix Caching (APC) performance. Specifically, when the block size is very small (such as 1), APC may be less effective because partial blocks are not cached, and repeated prompts may not benefit from cache reuse as efficiently. This is supported by user reports that increasing input length to match or exceed the block size improves APC effectiveness, and by the fact that AITER MLA currently only supports block_size=1 for DeepSeek models on ROCm, which may limit APC’s benefits compared to larger block sizes on other backends. However, the data sources do not provide a detailed, quantitative analysis of APC performance at block_size=1, so some uncertainty remains regarding the exact impact in your scenario. See discussions in vLLM issue #14686 and AITER MLA backend PR.
Would you like a step-by-step breakdown of how block size interacts with APC and why small block sizes may degrade performance?
Sources: