APC Slowdown with block-size=1

OrDadosh · May 26, 2025, 1:43pm

Hello everyone,

I’m currently working with vllm-rocm, a vLLM fork designed for AMD GPU devices.
I’m running Deepseek v3 using the AITER MLA backend (an AMD package that enhances Deepseek model performance on AMD GPUs through vLLM).

However, AITER only supports a block size of 1, and with this configuration, I’ve noticed a significant drop in APC performance.

My question is: Does the KV-cache block size impact APC performance?
In particular, should I expect degraded performance when using a small block size (e.g., 1)?

Thanks, and best regards

RunLLM · May 26, 2025, 1:43pm

It seems likely that KV-cache block size does impact Automatic Prefix Caching (APC) performance. Specifically, when the block size is very small (such as 1), APC may be less effective because partial blocks are not cached, and repeated prompts may not benefit from cache reuse as efficiently. This is supported by user reports that increasing input length to match or exceed the block size improves APC effectiveness, and by the fact that AITER MLA currently only supports block_size=1 for DeepSeek models on ROCm, which may limit APC’s benefits compared to larger block sizes on other backends. However, the data sources do not provide a detailed, quantitative analysis of APC performance at block_size=1, so some uncertainty remains regarding the exact impact in your scenario. See discussions in vLLM issue #14686 and AITER MLA backend PR.

Would you like a step-by-step breakdown of how block size interacts with APC and why small block sizes may degrade performance?

Sources:

Topic		Replies	Views
Why allocating KV-caches on a per-layer basis General	5	297	August 15, 2025
Vllm中，deepseek的模型刷新kvcache的地方在哪 General	44	134	March 30, 2026
Questions about PagedAttention v1 General	7	93	July 14, 2025
Support for (sparse) key value caching KV-Cache	16	589	May 3, 2025
Understanding vllm kv cache KV-Cache	5	1372	December 1, 2025

APC Slowdown with block-size=1

Related topics