Skip_leading_tokens 是什么意思

yangshanjun · February 13, 2026, 2:45am

@RunLLM 但是 prepare_input_ids 里面不是异步的吗，没发保证，他已经拷贝到gpu里面了呀

self.input_ids[:total_num_scheduled_tokens].copy_(

        self.input_ids_cpu\[:total_num_scheduled_tokens\], non_blocking=True)

RunLLM · February 13, 2026, 2:46am

你说得对，_prepare_input_ids 里确实常用 non_blocking=True 进行异步拷贝（如 .copy_(..., non_blocking=True)），但在 PyTorch 和 vLLM 的常规同步推理流程下，后续对 self.input_ids 的访问（如模型前向）会隐式触发 CUDA stream 同步，确保数据已拷贝到 GPU，CPU 线程会在模型前向时自动等待数据准备好。相关机制说明。

如果你在自定义异步调度或多 stream 场景下，确实需要手动同步（如 torch.cuda.synchronize()）以确保数据安全。但vLLM主流程下，模型前向前的数据访问已保证同步。

需要更详细的 CUDA stream 同步原理或代码示例吗？

Sources:

https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
什么情况下，一个请求会被重复schedule? General	58	107	January 6, 2026
Pp8并行，update_from_output 会等所有rank的 model_executor.execute_model 执行完了之后才会执行吗 General	84	297	January 8, 2026
VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit General	25	264	December 17, 2025
Why is it so slow to build a odeVLLM from source using Docker? General	39	87	January 17, 2026
Vllm中，deepseek的模型刷新kvcache的地方在哪 General	27	61	March 12, 2026

Skip_leading_tokens 是什么意思

Related topics