Question about V1 code(v0.8.2)

flyingfish · April 28, 2025, 7:45am

I recently reviewed the v1 code and have some questions about v1/core/SpecializedMmanager. Inside it, there’s a SlidingWindowManager for sliding window attention. When calculating the blocks to be discarded, it uses the following two lines of code:

last_useful_token = num_computed_tokens - self.sliding_window + 1
last_useful_block = last_useful_token // self.block_size

However, in boundary cases, it seems that a block might be deleted prematurely. Here’s an example:

Suppose a request has already computed 63 tokens. The block size is 16, and the sliding_window size is also 16. Currently, there are 4 blocks, storing tokens [1 - 16], [17 - 32], [33 - 48], and [49 - 63] respectively. At this moment, the sliding window should cover tokens [48 - 63], and the 48th token is in the 3rd block. Therefore, the function should keep the 3rd and 4th blocks and delete the 1st and 2nd blocks.

However, according to the code logic, it will calculate last_useful_token = 63 - 16 + 1 = 48, last_useful_block = 48 // 16 = 3. Then, following the subsequent for loop for i in range(last_useful_block - 1, -1, -1), the blocks with indices [0, 1, 2] in the list will all be deleted. That is, the 3rd block will also be removed. Won’t this lead to the deletion of the block where the 48th token is located?

I’m not clear about the logic of this part. Is it because the token to be computed in the current round is also included in the sliding window, so it’s assumed that they need to be shifted back by one position as a whole? Otherwise, in boundary cases, it seems that the block where the first token in the sliding window is located will be deleted.

RunLLM · April 28, 2025, 7:45am

The logic in the SlidingWindowManager for determining which blocks to discard is indeed based on the assumption that the current token being computed is included in the sliding window. This means that the sliding window is effectively shifted to include the token currently being processed, which is why the calculation for last_useful_token includes a +1 adjustment. This adjustment ensures that the sliding window covers the current token, preventing premature deletion of blocks that contain tokens still needed for computation.

In your example, the calculation of last_useful_token = 63 - 16 + 1 = 48 and last_useful_block = 48 // 16 = 3 is correct under this logic. The sliding window is intended to cover tokens [48 - 63], and the +1 ensures that the block containing the 48th token is not prematurely discarded. The loop for i in range(last_useful_block - 1, -1, -1) is designed to remove blocks that are no longer needed, starting from the block before last_useful_block. This logic should correctly retain the necessary blocks for the sliding window, including the block containing the 48th token.

If you would like more detail or further clarification, please let me know.

Sources:

vLLM Specialized Manager

flyingfish · April 28, 2025, 7:53am

In the context of chunk prefill, we usually compute multiple tokens at once. Given the current logic in the SlidingWindowManager for calculating the blocks to be discarded, I’m concerned that it might lead to errors when more than one token is computed in each new step. Just like in the previous boundary - case analysis, when dealing with multiple newly - computed tokens, the calculation of last_useful_token and last_useful_block may not accurately determine which blocks should be retained or discarded. Could you please clarify if this code is designed to handle such scenarios correctly?

RunLLM · April 28, 2025, 7:53am

The current logic in the SlidingWindowManager is primarily designed to handle scenarios where tokens are computed one at a time. When multiple tokens are computed in a single step, as is the case with chunk prefill, the logic for determining which blocks to discard may not be directly applicable without adjustments. The calculation of last_useful_token and last_useful_block assumes a single token computation, which could lead to inaccuracies when multiple tokens are involved.

In scenarios where multiple tokens are computed at once, the sliding window should be adjusted to account for the entire range of newly computed tokens. This means that the logic for determining last_useful_token should consider the range of tokens computed in the current step, rather than just the last token. Similarly, last_useful_block should be calculated based on the entire range of tokens that need to be retained within the sliding window.

If the current implementation does not account for this, it may require modifications to ensure that the sliding window logic correctly handles multiple token computations. This could involve adjusting the calculation of last_useful_token to cover the full range of newly computed tokens and ensuring that the loop for discarding blocks only removes those that are entirely outside the sliding window.

If you need more detailed guidance on how to implement these changes, please let me know.

Sources:

Topic		Replies	Views
Why not use subtraction just for total_num_scheduled_tokens calculate in cheduler? General	3	12	May 17, 2025
为什么在v1架构中tokenier使用线程池而不使用进程池 General	3	63	May 26, 2025
Vllm v1 在哪个地方判读结束符或者停止输出的？ General	1	16	June 21, 2025
Why remove bonus token of requset in draft model? Speculative Decoding	0	30	March 30, 2025
DP并行时在set_forward_context中做all_reduce作用是什么 General	17	75	May 30, 2025

Question about V1 code(v0.8.2)

Related topics