Questions about PagedAttention v1

When I use the CPU backend for decoding (without using the Attention provided by IPEX), it only supports BLOCK_SIZE=16 by default, and does not support other sizes. Why is this the case? The version is v0.9.0, and the code:

#define CALL_V1_KERNEL_LAUNCHER_BLOCK_SIZE(T)                     \
  switch (block_size) {                                           \
    case 16:                                                      \
      CALL_V1_KERNEL_LAUNCHER(T, 16);                             \
      break;                                                      \
    default:                                                      \
      TORCH_CHECK(false, "Unsupported block size: ", block_size); \
      break;                                                      \
  }

The CPU backend in vLLM (when not using Intel Extension for PyTorch, IPEX) only supports BLOCK_SIZE=16 because the custom kernel implementation for CPU explicitly restricts the supported block size to 16 in the code. Any other value will trigger an error, as seen in the macro you referenced. This is confirmed by both the code and user reports: “the custom kernel will raise a error” for non-16 block sizes unless IPEX is used, which allows 128 for better performance on Intel CPUs. On non-Intel CPUs, only block size 16 is implemented and supported in the kernel logic, so other sizes are not available by design in v0.9.0 and similar versions. See the discussion and code references in this PR and comments and the relevant source file.

Would you like a step-by-step breakdown of how this restriction is enforced in the code?

Sources:

But I would like to know why it only supports BLOCK_SIZE=16? Can you help explain this?

vLLM的CPU自定义kernel只支持BLOCK_SIZE=16,是因为其内核实现(如attention.cpp)只针对16做了模板实例化和优化,其他block size未实现对应的高效代码路径。这样做可以简化实现、减少维护成本,并确保性能和正确性。只有在使用IPEX时,才会有对128等更大block size的支持,因为Intel官方扩展提供了额外的优化实现。详见PR讨论源码

需要详细代码层面的解释吗?

Sources:

Your explanation is not clear. I have extended the support for BLOCK_SIZE=128 template instantiation and implemented a new reduceQKBlockKernel, but it still only outputs a portion of the correct results (approximately 16 tokens), while the rest are garbled, as shown below:

Prompt: 'The president of the United States is', Generated text: ' the head of state and government of the United States,. -s\n\n the the the schools'

出现这种现象的根本原因是vLLM CPU kernel的实现(如reduceQKBlockKernel及相关逻辑)在很多地方默认BLOCK_SIZE=16,并且数据分块、循环、内存布局等都基于16做了硬编码或假设。即使你扩展了模板和实现,其他相关代码(如softmax归一化、分块累加、输出写回等)仍然可能只适配16,导致大于16时部分token处理异常或越界,输出混乱。详见源码实现

需要详细分析哪些具体代码路径或变量导致了这个问题吗?

Sources:

你提到的这几个地方,能否给出代码位置?谢谢

是的,vLLM CPU kernel(如attention.cpp)中,数据分块、循环、内存布局等确实大量基于BLOCK_SIZE=16做了硬编码或假设。具体代码位置如下:

  1. 模板实例化与分块循环:如attention.cpp文件中,#define CALL_V1_KERNEL_LAUNCHER_BLOCK_SIZE(T)只实例化了16,相关循环和分块处理也直接用16。
  2. softmax与归一化:softmax归一化、QK累加等操作的循环边界、内存偏移等都直接用16,见cpu/attention.cpp的主kernel实现。
  3. 内存布局:如Q、K、V缓存的分配、访问、写回等,shape和步长都假定block size为16。

详细可参考attention.cpp源码

需要我帮你逐行定位某个函数或循环的具体实现吗?

Sources: