I recently reviewed the v1 code and have some questions about v1/core/SpecializedMmanager
. Inside it, there’s a SlidingWindowManager
for sliding window attention. When calculating the blocks to be discarded, it uses the following two lines of code:
last_useful_token = num_computed_tokens - self.sliding_window + 1
last_useful_block = last_useful_token // self.block_size
However, in boundary cases, it seems that a block might be deleted prematurely. Here’s an example:
Suppose a request has already computed 63 tokens. The block size is 16, and the sliding_window
size is also 16. Currently, there are 4 blocks, storing tokens [1 - 16], [17 - 32], [33 - 48], and [49 - 63] respectively. At this moment, the sliding window should cover tokens [48 - 63], and the 48th token is in the 3rd block. Therefore, the function should keep the 3rd and 4th blocks and delete the 1st and 2nd blocks.
However, according to the code logic, it will calculate last_useful_token = 63 - 16 + 1 = 48
, last_useful_block = 48 // 16 = 3
. Then, following the subsequent for
loop for i in range(last_useful_block - 1, -1, -1)
, the blocks with indices [0, 1, 2]
in the list will all be deleted. That is, the 3rd block will also be removed. Won’t this lead to the deletion of the block where the 48th token is located?
I’m not clear about the logic of this part. Is it because the token to be computed in the current round is also included in the sliding window, so it’s assumed that they need to be shifted back by one position as a whole? Otherwise, in boundary cases, it seems that the block where the first token in the sliding window is located will be deleted.