Issue with DynamicYaRN and Key-Value Cache Reuse in vLLM

Hello everyone,

I’ve been experimenting with using DynamicYaRN (Dynamic Rotary Position Embedding) together with vLLM, and I’ve encountered an issue that I’d like to get feedback on.

In DynamicYaRN, the position encoding is dynamically adjusted based on the sequence length (or other factors), and it’s scaled accordingly. This dynamic adjustment means that each time during inference, new rotary position encodings are recalculated, which in turn affects the computation of the key and value vectors. This seems to conflict with how vLLM handles key-value caching.

In vLLM, the Key-Value Cache is used to store previously computed key-value pairs, which are critical for accelerating autoregressive inference. Normally, these key-value pairs are computed using static position encodings, which means they can be reused for faster inference without recalculating attention scores. However, with DynamicYaRN, since position encoding changes dynamically, the key and value vectors are recalculated every time, leading to the inability to reuse the cached values, which impacts the inference speed and efficiency.

I have a few questions regarding this:

  1. Is it true that when using DynamicYaRN, the Key-Value Cache in vLLM cannot be reused?
  2. Is there any way to modify vLLM to support dynamic position encoding and still effectively reuse the KV cache?

For those interested, here’s the code for DynamicYaRN: LlamaDynamicYaRNScaledRotaryEmbedding.py

I’d really appreciate any insights or suggestions on how to address this issue.

Thanks in advance for your help!

You are correct: vLLM’s KV cache is designed for static rotary position encodings, and dynamic schemes like DynamicYaRN (or DynamicNTK) break the assumption that cached key/value tensors can be reused without recomputation. This is because the rotary embedding in vLLM is applied before caching, so if the position encoding changes dynamically, the cached values become invalid for new positions, making cache reuse impossible without major changes to the memory management and rotary embedding logic. This limitation is discussed in detail for similar dynamic rotary schemes in vLLM issue #1253.

There is currently no built-in support in vLLM for dynamic rotary position encodings that require recomputation of cached K/V tensors. Supporting this would require significant architectural changes, such as caching pre-rotary key states and reapplying the rotary embedding at each step, which would impact throughput and require new CUDA kernels or memory management logic. The vLLM team has acknowledged the complexity and performance trade-offs of such an approach, and as of now, there is no official solution or roadmap for this feature in vLLM (vLLM issue #1253).

Would you like a step-by-step breakdown of the technical challenges and possible workarounds?

Sources:

1 Like