vllm call this function in execute_model
generator.set_offset(generator.get_offset() - 4)
The function call generator.set_offset(generator.get_offset() - 4)
is used in vLLM to rewind the state of a random number generator associated with a request when a sampled token should be ignored, such as during partial prefill. This ensures that the generator’s state is consistent as if the token was never sampled, which is important for reproducibility and correct sampling behavior in subsequent steps. This logic is present in both GPU and TPU model runners, specifically when handling requests that should not advance due to incomplete prefill or chunked prefill scenarios, as shown in the code and comments in gpu_model_runner.py and tpu_model_runner.py.
If you need more detail on how this interacts with the rest of the sampling or scheduling logic, let me know.
Sources: