Hello everyone,
I’ve been experimenting with using DynamicYaRN
(Dynamic Rotary Position Embedding) together with vLLM
, and I’ve encountered an issue that I’d like to get feedback on.
In DynamicYaRN
, the position encoding is dynamically adjusted based on the sequence length (or other factors), and it’s scaled accordingly. This dynamic adjustment means that each time during inference, new rotary position encodings are recalculated, which in turn affects the computation of the key and value vectors. This seems to conflict with how vLLM
handles key-value caching.
In vLLM
, the Key-Value Cache is used to store previously computed key-value pairs, which are critical for accelerating autoregressive inference. Normally, these key-value pairs are computed using static position encodings, which means they can be reused for faster inference without recalculating attention scores. However, with DynamicYaRN
, since position encoding changes dynamically, the key and value vectors are recalculated every time, leading to the inability to reuse the cached values, which impacts the inference speed and efficiency.
I have a few questions regarding this:
- Is it true that when using
DynamicYaRN
, the Key-Value Cache invLLM
cannot be reused? - Is there any way to modify
vLLM
to support dynamic position encoding and still effectively reuse the KV cache?
For those interested, here’s the code for DynamicYaRN
: LlamaDynamicYaRNScaledRotaryEmbedding.py
I’d really appreciate any insights or suggestions on how to address this issue.
Thanks in advance for your help!