vLLM V1 - Default max CUDA graph size

Hi, I am doing some testing with vLLM V1. My model encountered an OOM during the CUDA graph capture.

Upon debugging, I noticed that in vLLM V1, we cap the max CUDA graph size based off of max_num_batched_tokens, which is different than V0 behavior of capping to max_num_seqs.

My question is, whether this is intentional in V1? I know these values could be passed in from the user, but it seems more intuitive the default max CUDA graph size is limited by max_num_seqs? As cuda graph is often used to improve TPOT.

Thanks you!

In V0, we can only capture cudagraph for decoding, and the decoding batchsize is limited by max-num-seqs .

In V1, we can use (piecewise) cudagraph for both prefill / decode / mixed prefill and decode, so the batchsize is limited by max-num-batched-tokens . I think that’s the reason.

It is true that this can lead to larger cudagraph memory cost though.

1 Like