How to crop kv_caches?

Lutalica · April 13, 2025, 2:44pm

I want to implement some customed features in Qwen2ForCausalLM, which will crop a certain length of kv_caches (e.g. discard the last 64 generated tokens). How can i implement this ?

I have tried to directly modified kv_caches that passed into forward() function, but I found that the shape of kv_cache is fixed as [2, num_blocks, block_size, num_kv_heads, head_size], I can use block_tables to fetch kv_cache corresponds to a completion, but certainly there are some parameters indicate ending position of actual kv_cache, but I just can’t find.

Another problem is when I successfully extract correct kv_cache to one completion, will it cause problem when I tried to crop kv_cache since it might affect another completion’s kv_cache because they may share same cuda_block ?

Topic		Replies	Views
Promblem about the PagedAttention.split_kv_cache implimentation General	3	53	June 11, 2025
Something weired about the reading procedure of q_vecs in page attention kernel KV-Cache	3	10	June 9, 2025
Questions about KV Cache Quantization General	1	40	June 30, 2025
KV Cache quantizing? NVIDIA GPU Support	3	77	June 2, 2025
How log kvcache usage and prefix hit rate in offline infer? General	19	99	May 13, 2025

How to crop kv_caches?

Related topics