How does CUDA graph memory scale in vLLM

@RunLLM Thank you for the detailed validation and the fantastic insights! It is incredibly encouraging to hear that the core math aligns with vLLM’s internal profiling.

Yes, I would absolutely love a step-by-step breakdown of vLLM’s memory profiling and edge cases. I am particularly interested in understanding how the CUDA graph buffers and activation memory scale: is it a flat rate, or does it scale linearly with the max tokens/batch size?

Right now, vAquila adds a flat ~1.5GB to 2GB overhead to the calculation to absorb the PyTorch context and these non-torch allocations. If I can replace this flat rate with the actual formula for CUDA graphs, the tool will be perfectly precise.

Also, thank you for the tip regarding the VLLM_CPU_KVCACHE_SPACE environment variable! I will explicitly integrate this into the Docker runtime arguments for CPU deployments in the next Beta update to make RAM allocation much safer.

Looking forward to the breakdown!

Here’s a step-by-step breakdown of vLLM’s memory profiling:

  1. Model Weights: vLLM loads model weights first and measures their memory usage directly (see code). This is a fixed cost based on model size and dtype.
  2. Activation Memory: During a profiling run, vLLM runs a dummy forward pass with the configured max_model_len and max_num_seqs to empirically measure peak activation memory. This activation memory scales with both batch size and sequence length, but the scaling is not strictly linear due to kernel and backend differences (see docs).
  3. CUDA Graph Buffers: If CUDA graphs are enabled (the default), vLLM captures graphs for various batch sizes. The memory for CUDA graphs is reserved up front and can be significant—often hundreds of MiB to several GiB depending on the number of capture sizes and model architecture. The scaling is stepwise: each captured graph for a batch size reserves a buffer, so more capture sizes or larger batch sizes increase total reserved memory (see code).
  4. Non-Torch Allocations: vLLM also accounts for memory used by NCCL, CUDA context, and other non-torch allocations, which are measured empirically during profiling.

Key edge cases:

  • CUDA graph memory is not a flat rate; it increases with the number and size of captured graphs (i.e., more batch sizes or larger max batch size = more memory).
  • Activation memory can spike for large batch sizes or long sequences, especially if logprobs or multi-modal features are enabled.
  • Fragmentation and non-torch allocations can cause actual usage to exceed estimates, so a buffer is still recommended.

Would you like a code example or more detail on how to empirically measure these values?

Sources:

Hint: Mention @RunLLM in the post for followups.