vLLM V1 - Default max CUDA graph size

cthi · April 1, 2025, 6:35pm

Hi, I am doing some testing with vLLM V1. My model encountered an OOM during the CUDA graph capture.

Upon debugging, I noticed that in vLLM V1, we cap the max CUDA graph size based off of max_num_batched_tokens, which is different than V0 behavior of capping to max_num_seqs.

github.com/vllm-project/vllm

vllm/config.py

a79cc68b3


      
                      batch_size_capture_list = [
                          size for size in possible_sizes
                          if size <= max_batchsize_to_capture
                      ]
              else:
                  batch_size_capture_list = []
                  if self.model_config is not None and \
                      not self.model_config.enforce_eager:
                      batch_size_capture_list = [1, 2, 4
                                                 ] + [i for i in range(8, 513, 8)]
                      max_num_tokens = self.scheduler_config.max_num_batched_tokens
                      batch_size_capture_list = [
                          size for size in batch_size_capture_list
                          if size <= max_num_tokens
                      ]
          
              self.compilation_config.init_with_cudagraph_sizes(
                  batch_size_capture_list)
          
          def __str__(self):
              return (

My question is, whether this is intentional in V1? I know these values could be passed in from the user, but it seems more intuitive the default max CUDA graph size is limited by max_num_seqs? As cuda graph is often used to improve TPOT.

Thanks you!

youkaichao · April 12, 2025, 3:10pm

In V0, we can only capture cudagraph for decoding, and the decoding batchsize is limited by max-num-seqs .

In V1, we can use (piecewise) cudagraph for both prefill / decode / mixed prefill and decode, so the batchsize is limited by max-num-batched-tokens . I think that’s the reason.

It is true that this can lead to larger cudagraph memory cost though.

Topic		Replies	Views
Data Parallel Attention Cuda Graph General	1	20	May 26, 2025
Why vllm cannot fully use GPU in batch processing General	12	179	March 29, 2025
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	47	May 22, 2025
Multimodal inference guideline? General	13	58	May 27, 2025
Questions on piecewise torch compile design torch.compile integration	36	159	May 29, 2025

vLLM V1 - Default max CUDA graph size

Related topics