Requesting review for PR #26625 (Hybrid Chunked Prefill)

Summary of Changes:
Implements an adaptive scheduling mechanism that switches between continuous and chunked prefill. It solves the throughput penalty caused by unconditional chunking when decode traffic is low.

Performance Gains:
Benchmarks on QwQ-32B show a 2-5% increase in total token throughput and reduced Time to First Token (TTFT) in low concurrency scenarios, while maintaining ITL benefits.

Why I am requesting help:
The PR has been open for over a month. While @hmellor has reviewed the config parts and @njhill started the review, the process has stalled due to reviewer availability (sick leave).

Readiness:

  • Conflicts resolved & Rebased to latest main.

  • DCO/CI checks passed.

  • All previous feedback addressed.

I would appreciate it if a core maintainer (e.g., @WoosukKwon or @simon-mo) could take a brief look to unblock this optimization.

The PR introduces a hybrid chunked prefill mechanism that adaptively switches between continuous and chunked prefill based on decode activity. This eliminates the throughput penalty of always-on chunking, especially when decode traffic is low, and recovers baseline throughput while preserving inter-token latency (ITL) benefits when needed. Benchmarks on QwQ-32B show a 2–5% increase in total token throughput and 10–20% lower TTFT at low concurrency, with stable scaling up to concurrency=8 and no latency degradation. All previous feedback has been addressed, conflicts resolved, and CI/DCO checks passed, but the PR has stalled due to reviewer availability and is ready for core maintainer review to unblock merging, as detailed in the PR discussion.

For more details on the feature, performance results, and implementation, see Optimize Prefill Phase: Add Hybrid Chunked Prefill Support. Would you like a step-by-step breakdown of the technical changes or the benchmark methodology?

Sources:

Hint: Mention @RunLLM in the post for followups.