When I use the CPU backend for decoding (without using the Attention provided by IPEX), it only supports BLOCK_SIZE=16 by default, and does not support other sizes. Why is this the case? The version is v0.9.0, and the code:
The CPU backend in vLLM (when not using Intel Extension for PyTorch, IPEX) only supports BLOCK_SIZE=16 because the custom kernel implementation for CPU explicitly restricts the supported block size to 16 in the code. Any other value will trigger an error, as seen in the macro you referenced. This is confirmed by both the code and user reports: “the custom kernel will raise a error” for non-16 block sizes unless IPEX is used, which allows 128 for better performance on Intel CPUs. On non-Intel CPUs, only block size 16 is implemented and supported in the kernel logic, so other sizes are not available by design in v0.9.0 and similar versions. See the discussion and code references in this PR and comments and the relevant source file.
Would you like a step-by-step breakdown of how this restriction is enforced in the code?
Your explanation is not clear. I have extended the support for BLOCK_SIZE=128 template instantiation and implemented a new reduceQKBlockKernel, but it still only outputs a portion of the correct results (approximately 16 tokens), while the rest are garbled, as shown below:
Prompt: 'The president of the United States is', Generated text: ' the head of state and government of the United States,. -s\n\n the the the schools'
出现这种现象的根本原因是vLLM CPU kernel的实现(如reduceQKBlockKernel及相关逻辑)在很多地方默认BLOCK_SIZE=16,并且数据分块、循环、内存布局等都基于16做了硬编码或假设。即使你扩展了模板和实现,其他相关代码(如softmax归一化、分块累加、输出写回等)仍然可能只适配16,导致大于16时部分token处理异常或越界,输出混乱。详见源码实现。