Requesting review for PR #26625 (Hybrid Chunked Prefill)

Ther-LF · November 27, 2025, 5:49am

Summary of Changes:
Implements an adaptive scheduling mechanism that switches between continuous and chunked prefill. It solves the throughput penalty caused by unconditional chunking when decode traffic is low.

Performance Gains:
Benchmarks on QwQ-32B show a 2-5% increase in total token throughput and reduced Time to First Token (TTFT) in low concurrency scenarios, while maintaining ITL benefits.

Why I am requesting help:
The PR has been open for over a month. While @hmellor has reviewed the config parts and @njhill started the review, the process has stalled due to reviewer availability (sick leave).

Readiness:

Conflicts resolved & Rebased to latest main.
DCO/CI checks passed.
All previous feedback addressed.

I would appreciate it if a core maintainer (e.g., @WoosukKwon or @simon-mo) could take a brief look to unblock this optimization.

RunLLM · November 27, 2025, 5:50am

The PR introduces a hybrid chunked prefill mechanism that adaptively switches between continuous and chunked prefill based on decode activity. This eliminates the throughput penalty of always-on chunking, especially when decode traffic is low, and recovers baseline throughput while preserving inter-token latency (ITL) benefits when needed. Benchmarks on QwQ-32B show a 2–5% increase in total token throughput and 10–20% lower TTFT at low concurrency, with stable scaling up to concurrency=8 and no latency degradation. All previous feedback has been addressed, conflicts resolved, and CI/DCO checks passed, but the PR has stalled due to reviewer availability and is ready for core maintainer review to unblock merging, as detailed in the PR discussion.

For more details on the feature, performance results, and implementation, see Optimize Prefill Phase: Add Hybrid Chunked Prefill Support. Would you like a step-by-step breakdown of the technical changes or the benchmark methodology?

Sources:

Optimize Prefill Phase: Add Hybrid Chunked Prefill Support

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Why is chunked prefill always enabled in V1? General	3	1228	September 18, 2025
V1 Chunked Prefill Scheduling Policy: how prefill would be scheduled? Scheduling	8	627	March 25, 2025
Disaggregated Prefilling中什么是tail itl General	6	59	February 16, 2026
Vllm中，使能了chunk prefill, 那它的chunk size是在哪里设的，他和 long_prefill_token_threshold参数有关系吗 General	1	221	November 23, 2025
Minimum requirements for Disaggregated Prefilling? Disaggregated Prefilling	0	105	April 9, 2025

Requesting review for PR #26625 (Hybrid Chunked Prefill)

Related topics