I was looking at the V1 code and noticed that chunked prefill is always enabled (even if the user explicitly unsets it in the command line). I was able to track the source of this to the following PR [1]. I understand the logic of the PR to not raise the error whenever chunked prefill is set, however I don’t quite follow why chunked prefill should always be enabled in V1.
In vLLM V1, chunked prefill is always enabled by design and cannot be disabled, even if the user tries to unset it via command-line flags. This is because the V1 scheduler architecture is built around chunked prefill as a core mechanism for batching and scheduling, which improves throughput and latency by allowing large prefills to be split and interleaved with decode requests. The relevant PR you referenced (#11061) enforces this by always setting enable_chunked_prefill = True for V1, and later PRs and documentation confirm that disabling chunked prefill in V1 is not supported and will raise an error if attempted.
This design choice is documented in the official vLLM optimization guides, which state: “In vLLM V1, chunked prefill is always enabled by default. This is different from vLLM V0, where it was conditionally enabled based on model characteristics” (Optimization and Tuning). The V1 scheduler does not distinguish between prefill and decode stages, and chunked prefill is considered a first-class, non-optional feature (discussion). Attempts to disable it in V1 will result in an explicit error (PR #21645).
Would you like more detail on the technical reasons or scheduler design?
I see, so the V1 scheduler assumes chunked prefill will always yield better results for the relevant setups and makes it mandatory for performance reasons.
It also seems that bypassing this would violate scheduler assertions and possibly lead to bugs.
Got it, thanks.
Correct—V1’s scheduler is fundamentally designed around chunked prefill, and disabling it would break core scheduling logic, likely causing assertion failures or undefined behavior (PR #11061, Optimization and Tuning). The architecture assumes chunked prefill for both correctness and optimal performance, so bypassing it is unsupported and unsafe.
Let me know if you want a deeper dive into the technical details or scheduler internals.