Max_model_len vs GPU memory Usage

Typically it is believed that reducing the max_model_len would help in reducing the GPU memory usage.

But this fact was not observed while i tried deploying a model meta-llama-3.2-3b https://build.nvidia.com/meta/llama-3.2-3b-instruct?nim=self-hosted from NIM model repo o n Tesla T4 GPU (16GB VRAM).

Here I tried deploying the model using different values of max_model_len but the GPU memory usage remained constant.

here are logs:
Logs for max_model_len: 8192

the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.90) = 13.11GiB
model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.22GiB; the rest of the memory reserved for KV Cache is 5.83GiB.
INFO 2026-06-22 05:04:45.292 executor_base.py:111] # cuda blocks: 3409, # CPU blocks: 2340
INFO 2026-06-22 05:04:45.292 executor_base.py:116] Maximum concurrency for 8192 tokens per request: 6.66x

logs for max_model_len: 4096

the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.90) = 13.11GiB
model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 5.85GiB.
INFO 2026-06-22 04:43:29.435 executor_base.py:111] # cuda blocks: 3422, # CPU blocks: 2340
INFO 2026-06-22 04:43:29.435 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 13.37x

logs for max_model_len: 16384

the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.90) = 13.11GiB
model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.26GiB; the rest of the memory reserved for KV Cache is 5.78GiB.
INFO 2026-06-22 05:29:37.546 executor_base.py:111] # cuda blocks: 3381, # CPU blocks: 2340
INFO 2026-06-22 05:29:37.546 executor_base.py:116] Maximum concurrency for 16384 tokens per request: 3.30x

Now this fact is contrary to what we think that lowering the context length would help us save some GPU memory.

Can you explain me the reason for such a behavior.

Reducing max_model_len in vLLM does not always reduce total GPU memory usage because vLLM pre-allocates GPU memory based on the formula: (total memory * gpu_memory_utilization) - weights - maximum_activation_size. The actual memory reserved for the KV cache is what remains after accounting for model weights and the peak activation memory measured during a profiling run with your chosen max_model_len. For small to moderate changes in max_model_len, the difference in activation memory is minor, so the overall GPU memory usage remains nearly constant—only the number of concurrent requests (concurrency) changes, not the total memory footprint. This is expected vLLM behavior and not a bug, as confirmed in this discussion.

Would you like a step-by-step breakdown of how vLLM calculates and allocates memory for different max_model_len values?

Sources:

Hint: Mention @Herald in the post for followups.

@RunLLM yes, help me understand why lowering the max_model_len doesnt help us reduce the gpu usage, and why is this rule used to state that it will help in reducing the GPU usage, is it related to some version of vllm?