Why is cuda graph capture sizes limited by max_num_seqs

My understanding for forward pass in vllm is that it is done on a batch of tokens taken from the scheduler output and limited by max_num_batched_tokens. In the forward pass I see these number of tokens are padded to the nearest cuda graph size if cuda graph capture is enabled. However, the sizes of cuda graph is limited by max_num_seqs, why is that the case? Why is CUDA graph capture sizes limited by max_num_seqs? Shouldn’t it be limited by max_num_batched_tokens when inference is being done on batch of tokens? Or, is my understanding of continuous batching incorrect and the forward pass on the model is instead done on a batch of sequences (If that is the case, please explain how is a proper batch of tensors formed from different sequences of varying lengths at each scheduling step)?

Your understanding is mostly correct: vLLM’s forward pass operates on a batch of tokens, with the batch size limited by max_num_batched_tokens and/or max_num_seqs as determined by the scheduler. When CUDA graph capture is enabled, the batch is padded to the nearest CUDA graph size. However, the default CUDA graph capture sizes are often set based on max_num_seqs (number of sequences), not just the total number of tokens, because CUDA graphs are typically captured for specific batch (sequence) sizes, and the shape of intermediate tensors in the model depends on both the number of sequences and their lengths. This is a tradeoff for practical implementation and memory efficiency, as discussed in this PR and related comments.

In continuous batching, the scheduler forms a batch by selecting up to max_num_seqs sequences and up to max_num_batched_tokens tokens in total. If CUDA graph capture is enabled, the batch is padded to the nearest supported CUDA graph size, which by default is aligned with max_num_seqs (see discussion). This means the CUDA graph is replayed for a fixed number of sequences, and any unused slots are padded. The actual batch of tensors is formed by concatenating/padding the input tokens from each sequence to match the expected input shape for the CUDA graph, as shown in the model runner code. If you want CUDA graph capture to be based on token count rather than sequence count, this is not currently the default behavior and is a subject of ongoing discussion in the vLLM community.

Would you like a step-by-step breakdown of how the batch is constructed and padded for CUDA graph capture?

Sources: