vLLM V1 - Default max CUDA graph size

Hi, I am doing some testing with vLLM V1. My model encountered an OOM during the CUDA graph capture.

Upon debugging, I noticed that in vLLM V1, we cap the max CUDA graph size based off of max_num_batched_tokens, which is different than V0 behavior of capping to max_num_seqs.

My question is, whether this is intentional in V1? I know these values could be passed in from the user, but it seems more intuitive the default max CUDA graph size is limited by max_num_seqs? As cuda graph is often used to improve TPOT.

Thanks you!

In V0, we can only capture cudagraph for decoding, and the decoding batchsize is limited by max-num-seqs .

In V1, we can use (piecewise) cudagraph for both prefill / decode / mixed prefill and decode, so the batchsize is limited by max-num-batched-tokens . I think that’s the reason.

It is true that this can lead to larger cudagraph memory cost though.

Are you referring to V1’s support for CUDA Graph during the prefill phase? Could you please provide some insights into its implementation?