Hi, I’m wondering why vLLM allocates KV-cache block tensors on a per-layer basis. I found this snippet from CacheEnging._allocate_kv_cache in vllm/worker/cache_engine.py, which creates a torch.Tensor for each attention layer. For the LLMs whose all layers have the same K/V dimensions, is there any drawback to merging the KV-cache blocks of all layers as a huge tensor?
for _ in range(self.num_attention_layers):
# null block in CpuGpuBlockAllocator requires at least that
# block to be zeroed-out.
# We zero-out everything for simplicity.
layer_kv_cache = torch.zeros(
kv_cache_allocation_shape,
dtype=self.dtype,
pin_memory=pin_memory,
device=device).permute(*kv_cache_stride_order)
# view back to (TOTAL_PAGES, PAGE_SIZE, entry_shape...) for cases
# when entry_shape is higher than 1D
kv_cache.append(layer_kv_cache)
vLLM allocates KV-cache tensors per layer to support models with heterogeneous attention mechanisms (e.g., hybrid models with sliding window, cross-attention, or Mamba layers), where each layer may have different KV cache requirements. This design enables flexible memory management, per-layer cache policies, and compatibility with advanced features like prefix caching and hybrid allocators. For models where all layers have identical KV dimensions, merging into a single large tensor could theoretically reduce overhead, but it would complicate memory management, block allocation, and future extensibility for hybrid or non-uniform models, as discussed in CacheEngine source and Hybrid Memory Allocator RFC.
Additionally, the current per-layer allocation allows vLLM to efficiently implement features like LRU eviction, group-based block tables, and per-layer cache sharing, which would be harder to manage with a monolithic tensor, especially as models become more complex (see docs). Would you like a step-by-step breakdown of the trade-offs or implementation details?
It seems likely that the block size limit of 32 tokens on CUDA is due to kernel and memory layout constraints, as well as performance tuning for current CUDA attention implementations. The vLLM documentation states that “on CUDA devices, only block sizes up to 32 are supported,” but does not provide a detailed technical rationale. There is some user discussion questioning this limit and noting that larger block sizes (e.g., 128) may work on newer hardware, but the official docs and code enforce 32 as the upper bound.
Would you like more detail on possible kernel or hardware reasons for this limit?
My understanding is that the 32-token block size limit for CUDA in vLLM is primarily a conservative choice based on kernel compatibility, memory layout, and ensuring correctness across a wide range of GPUs and attention backends. According to the vLLM documentation, only block sizes up to 32 are officially supported on CUDA devices. This is echoed in the CLI and engine argument docs, which list 1, 8, 16, and 32 as valid choices for CUDA, with larger sizes (64, 128) only for HPU or other platforms (see here).
From user discussions and issues, it seems that the limit is not strictly a hardware restriction, but rather a safeguard:
The CUDA attention kernels in vLLM are tuned and tested for block sizes up to 32, which helps avoid shared memory overflows and ensures efficient memory access patterns.
Some users have reported success with larger block sizes on newer GPUs (e.g., Hopper), but this is not guaranteed to work on all CUDA devices or with all attention backends, especially since some kernels (like FlashAttention) may have their own minimum or maximum block size requirements (see discussion).
The vLLM team appears to enforce this limit to avoid subtle bugs, fragmentation, or performance regressions that could arise from untested configurations.
In summary, the 32-token block size is a conservative, compatibility-driven default that balances performance and reliability across CUDA GPUs, rather than a hard technical limit imposed by CUDA itself.