Max_model_len vs GPU memory Usage

Ashhar-armada · June 22, 2026, 5:49am

Typically it is believed that reducing the max_model_len would help in reducing the GPU memory usage.

But this fact was not observed while i tried deploying a model meta-llama-3.2-3b https://build.nvidia.com/meta/llama-3.2-3b-instruct?nim=self-hosted from NIM model repo o n Tesla T4 GPU (16GB VRAM).

Here I tried deploying the model using different values of max_model_len but the GPU memory usage remained constant.

here are logs:
Logs for max_model_len: 8192

the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.90) = 13.11GiB
model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.22GiB; the rest of the memory reserved for KV Cache is 5.83GiB.
INFO 2026-06-22 05:04:45.292 executor_base.py:111] # cuda blocks: 3409, # CPU blocks: 2340
INFO 2026-06-22 05:04:45.292 executor_base.py:116] Maximum concurrency for 8192 tokens per request: 6.66x

logs for max_model_len: 4096

the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.90) = 13.11GiB
model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 5.85GiB.
INFO 2026-06-22 04:43:29.435 executor_base.py:111] # cuda blocks: 3422, # CPU blocks: 2340
INFO 2026-06-22 04:43:29.435 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 13.37x

logs for max_model_len: 16384

the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.90) = 13.11GiB
model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.26GiB; the rest of the memory reserved for KV Cache is 5.78GiB.
INFO 2026-06-22 05:29:37.546 executor_base.py:111] # cuda blocks: 3381, # CPU blocks: 2340
INFO 2026-06-22 05:29:37.546 executor_base.py:116] Maximum concurrency for 16384 tokens per request: 3.30x

Now this fact is contrary to what we think that lowering the context length would help us save some GPU memory.

Can you explain me the reason for such a behavior.

RunLLM · June 22, 2026, 5:49am

Reducing max_model_len in vLLM does not always reduce total GPU memory usage because vLLM pre-allocates GPU memory based on the formula: (total memory * gpu_memory_utilization) - weights - maximum_activation_size. The actual memory reserved for the KV cache is what remains after accounting for model weights and the peak activation memory measured during a profiling run with your chosen max_model_len. For small to moderate changes in max_model_len, the difference in activation memory is minor, so the overall GPU memory usage remains nearly constant—only the number of concurrent requests (concurrency) changes, not the total memory footprint. This is expected vLLM behavior and not a bug, as confirmed in this discussion.

Would you like a step-by-step breakdown of how vLLM calculates and allocates memory for different max_model_len values?

Sources:

https://github.com/vllm-project/vllm/issues/5416

Hint: Mention @Herald in the post for followups.

Ashhar-armada · June 22, 2026, 5:57am

@RunLLM yes, help me understand why lowering the max_model_len doesnt help us reduce the gpu usage, and why is this rule used to state that it will help in reducing the GPU usage, is it related to some version of vllm?

Topic		Replies	Views
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel KV-Cache	1	1176	October 26, 2025
Which arguments affect GPU memory General	3	1120	September 3, 2025
Some questions about Maximum concurrency General	5	2008	July 22, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	892	September 8, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1929	March 3, 2026

Max_model_len vs GPU memory Usage

Related topics