Hello, I have a question about the max_num_seqs parameter when using Pipeline Parallelism (PP). The documentation (docs/configuration/conserving_memory.md#41) states that max_num_seqs is analogous to maximum batch size.
While examining the code, I noticed (Next, I refer to v0.9.0 vllm/engine):
In (async_llm_engine.py#L853-L855), pipeline_parallel_size virtual engines are created
Each virtual engine has its own scheduler (llm_engine.py#L342-L347)
Each scheduler is initialized with max_num_seqs (arg_utils.py#L1114)
Is it correct that the actual maximum batch size when using PP is max_num_seqs * pipeline_parallel_size?
No, the actual maximum batch size when using Pipeline Parallelism (PP) in vLLM is not max_num_seqs * pipeline_parallel_size. The max_num_seqs parameter sets the maximum number of sequences (i.e., batch size) per iteration for the entire engine, not per pipeline stage or virtual engine. Even though each virtual engine (corresponding to a pipeline stage) has its own scheduler, the scheduler logic ensures that the global batch size across all stages does not exceed max_num_seqs. This is confirmed in both the code and user discussions: “max_num_seqs is per LLMEngine, the number of threads does not matter” and “it refers to the number of sequences in a batch for a single instance” (source, source, source).
Therefore, the maximum batch size is limited by max_num_seqs, not multiplied by pipeline_parallel_size. If you set max_num_seqs=256 and pipeline_parallel_size=4, the maximum number of concurrent sequences processed in a single batch is still 256, not 1024. Would you like more detail on how the schedulers coordinate in PP?
Thanks, I’ve read the answer from RunLLM Bot. I’d appreciate clarification from the developers as there appear to be contradictions with my observations:
By my observations in (metrics.py#L489), the Running: %d reqs value consistently reaches max_num_seqs * pipeline_parallel_size during operation
In reviewing the Scheduler class code (scheduler.py), I was unable to find implementation of coordination between schedulers for max_num_seqs limits when using pipeline parallelism
Also, the referenced issues don’t specifically address this PP-specific behavior of max_num_seqs
Could you please clarify:
a) If max_num_seqs is enforced globally, where is the inter-scheduler coordination implemented in the code?
b) Why does the logger show running requests equal to max_num_seqs * pp_size if batch size shouldn’t exceed max_num_seqs?
a) There is no explicit global coordination in the code to enforce max_num_seqs across all pipeline stages when using pipeline parallelism; each virtual engine’s scheduler enforces max_num_seqs independently (source). This means the code does not aggregate or limit the total number of running sequences across all virtual engines to max_num_seqs.
b) As a result, the logger can show Running requests equal to max_num_seqs * pipeline_parallel_size, because each scheduler can schedule up to max_num_seqs sequences independently, leading to a total running count that is the product of the two (source). This is a known behavior and not a contradiction—it’s a consequence of the current implementation.
Would you like a step-by-step breakdown of how this happens in the code?