什么情况下，一个请求会被重复schedule?

RunLLM · January 2, 2026, 2:58pm

是的，前面的chunk不会直接输出生成内容，其计算结果（主要是KV cache）仅用于后续chunk的推理，前面chunk的输出会被直接丢弃，不会作为最终输出返回给用户。详见官方性能说明和相关讨论。

需要详细解释chunk prefill的调度和输出机制吗？

Sources:

https://github.com/vllm-project/vllm/issues/7330

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 3, 2026, 1:17pm

@RunLLM chunk prefill 计算，有掩码吗，后面的chunk的掩码是如何设置的

RunLLM · January 3, 2026, 1:18pm

在chunk prefill计算中，每个chunk的attention mask（掩码）会根据其在整体prompt中的位置动态设置。第一个chunk使用causal mask（只关注自身及之前token）；后续chunk则需结合前面所有token，掩码会扩展，确保每个token只能看到自己及之前所有token，实现自回归特性。详见官方API实现说明。

具体来说，后续chunk的attention mask会覆盖到前面所有已prefill的token，通常通过拼接或扩展mask矩阵实现。例如，PyTorch中可通过causal=True参数和动态调整mask shape来实现。

需要详细的代码示例或源码解读吗？

Sources:

https://docs.vllm.ai/en/stable/api/vllm/vllm.v1.attention.backends.mla.common.html#chunked-prefill

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 3, 2026, 1:19pm

@RunLLM chunk prefill 如何关闭

RunLLM · January 3, 2026, 1:20pm

在 vLLM V1 引擎中，chunked prefill 是默认强制开启的，无法通过参数关闭。即使设置 --no-enable-chunked-prefill，日志也会显示 chunked prefill 依然启用。官方文档和社区讨论明确说明 V1 不支持关闭该特性。如需关闭，只能切换回 V0 引擎（如设置环境变量 export VLLM_USE_V1=0）。

需要详细说明切换引擎或相关参数吗？

Sources:

https://github.com/vllm-project/vllm/issues/20914

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 9:02am

@RunLLM

    \# Cache the sampled tokens in the model runner, so that the scheduler

    \# doesn't need to send them back.

    \# NOTE(woosuk): As an exception, when using PP, the scheduler sends

    \# the sampled tokens back, because there's no direct communication

    \# between the first-stage worker and the last-stage worker.  这个如何理解

RunLLM · January 4, 2026, 9:03am

这段注释的意思是：在普通（非Pipeline Parallelism, PP）模式下，采样得到的token会直接缓存在model runner里，调度器（scheduler）无需再把这些token传回去，减少通信和序列化开销。但在PP模式下，由于第一阶段和最后一阶段的worker之间没有直接通信通道，采样token只能由scheduler中转传递回去，所以需要scheduler把采样token再发回model runner。详见官方PR说明和源码实现。

简言之：非PP时token本地缓存，PP时token需经scheduler转发。

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 9:23am

@RunLLM

        # For the last rank, we don't need to update the token_ids_cpu
        # because the sampled tokens are already cached.
        if not is_last_rank:
            # Add new_token_ids to token_ids_cpu.
            start_token_index = num_computed_tokens
            end_token_index = num_computed_tokens + len(new_token_ids)
            self.input_batch.token_ids_cpu[
                req_index, start_token_index:end_token_index
            ] = new_token_ids
            self.input_batch.num_tokens_no_spec[req_index] = end_token_index
为什么要更新 input_batch.token_ids_cpu ?

RunLLM · January 4, 2026, 9:24am

需要更新 input_batch.token_ids_cpu，是因为在非最后一阶段（非last rank）时，采样得到的新token需要写入CPU端的token序列，供后续推理步骤或其他stage使用。这样可以保证token序列的完整性和一致性，支持流水线并行（PP）或多阶段推理时的数据同步。详见源码实现和注释和官方API说明。

简言之，更新token_ids_cpu是为了让后续stage能正确获取到最新的token序列。

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 9:25am

@RunLLM 中间层rank 不是不会产生 new_token_ids 吗

RunLLM · January 4, 2026, 9:26am

是的，中间层（非首、非末rank）在Pipeline Parallelism（PP）模式下不会产生新的new_token_ids。只有最后一层（last rank）负责采样生成新token，前面的rank只负责前向计算和中间状态传递，不会有new_token_ids输出。详见源码实现和注释。

需要详细解释PP各rank职责或数据流吗？

Sources:

https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 9:26am

yangshanjun:

if not is_last_rank:
            # Add new_token_ids to token_ids_cpu.
            start_token_index = num_computed_tokens
            end_token_index = num_computed_tokens + len(new_token_ids)
            self.input_batch.token_ids_cpu[
                req_index, start_token_index:end_token_index
            ] = new_token_ids
            self.input_batch.num_tokens_no_spec[req_index] = end_token_index

@RunLLM 那这个代码，中间层rank 是如何更新的

RunLLM · January 4, 2026, 9:27am

在Pipeline Parallelism（PP）模式下，中间层rank（既不是首rank也不是last rank）实际上不会产生new_token_ids，因此这段代码在中间层rank不会被执行。只有非last rank（通常是首rank）在需要时会更新token_ids_cpu，用于后续stage的数据传递。中间层rank只负责中间状态的前向传递，不会采样新token，也不会更新token_ids_cpu。详见源码实现和注释。

如果你想了解中间层rank的具体数据流或KV cache传递机制，可以继续提问。

Sources:

https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 9:28am

@RunLLM 但是它的条件不是 if not is_last_rank 这个吗，代表非最后一层rank都会走进来啊

RunLLM · January 4, 2026, 9:28am

你说得对，条件是 if not is_last_rank，意味着只要不是最后一层rank（包括首rank和所有中间rank）都会进入这段代码。但实际上，只有首rank会有new_token_ids需要写入，其他中间rank由于没有采样新token，new_token_ids通常为空或不会被赋值，所以虽然会执行，但没有实际数据写入。详见源码实现和PR讨论。

如果你需要更详细的rank间数据流或变量赋值逻辑说明，可以继续提问。

Sources:

https://github.com/vllm-project/vllm/pull/20291

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 9:31am

@RunLLM new_token_ids为空的化，那用[-1]索引的话，不是有问题吗

RunLLM · January 4, 2026, 9:31am

是的，如果new_token_ids为空，直接用new_token_ids[-1]会抛出IndexError异常。源码中通常会先判断new_token_ids是否为空，只有在len(new_token_ids) > 0时才会访问[-1]，以避免出错。详见源码实现。

需要详细的防护代码示例或异常处理说明吗？

Sources:

https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 1:46pm

@RunLLM PP并行中，token_ids_cpu 是如何更新的

RunLLM · January 4, 2026, 1:46pm

在Pipeline Parallelism（PP）并行模式下，只有非最后一层（not is_last_rank）的rank会更新 token_ids_cpu。具体做法是：采样得到的新token（new_token_ids）会被写入 input_batch.token_ids_cpu 的对应位置，更新区间为 [start_token_index:end_token_index]，并同步更新 num_tokens_no_spec 等元数据。最后一层rank不需要更新 token_ids_cpu，因为采样token已被缓存。详见源码实现和官方API说明。

简要代码逻辑如下（仅供参考）：

if not is_last_rank:
    start_token_index = num_computed_tokens
    end_token_index = num_computed_tokens + len(new_token_ids)
    self.input_batch.token_ids_cpu[req_index, start_token_index:end_token_index] = new_token_ids
    self.input_batch.num_tokens_no_spec[req_index] = end_token_index

需要更详细的数据流或PP各rank职责说明吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · January 4, 2026, 1:48pm

@RunLLM 为什么非最后一层需要更新 token_ids_cpu ?

Topic		Replies	Views
Pp8并行，update_from_output 会等所有rank的 model_executor.execute_model 执行完了之后才会执行吗 General	84	309	January 8, 2026
VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit General	25	279	December 17, 2025
Vllm中，deepseek的模型刷新kvcache的地方在哪 General	43	84	March 27, 2026
Why is it so slow to build a odeVLLM from source using Docker? General	39	94	January 17, 2026
一个长输入的请求，切chunk ，比如切了4份，prefill的时候，这四个可以同时做prefill 吗，还是有依赖关系的 General	15	81	January 21, 2026

什么情况下，一个请求会被重复schedule?

Related topics