How to size LLMs

anticool · September 7, 2025, 7:25am

I want to deploy an LLM model and I want to know exactly how many GPUs and VRAM should be assigned based on max context length, mean tokens, batches. Is there any tool that provide sizing report for a model by adjusting the model parameters

RunLLM · September 7, 2025, 7:25am

vLLM provides a Colab-based calculator that estimates the required VRAM and number of chips/GPUs for a given model, context length, and batch size. This tool reports KV cache size per token/request, model weight memory, allocated KV cache memory, and the approximate maximum number of concurrent requests (–max-num-seqs) you can set. You can find the calculator linked in the official documentation under TPU optimization tips. Adjusting parameters in the calculator gives you a sizing report for your deployment needs. See TPU Optimization Tips for details and the calculator link.

Would you like more detail on how to use the calculator or interpret its results?

Sources:

TPU Optimization Tips

anticool · September 22, 2025, 10:20am

If I use max number batches and max context length in the arguments during the deployment of the vllm pod, will it seize the max vram required from the beginning based on those settings or vram will be seized dynamically when it is needed?

The gpu-memory-utilization is 0.9 be default, is this for KV Cache and weight only or also activations?

Topic		Replies	Views
Deploy a big LLM when GPU VRAM not enough General	21	1950	August 13, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	377	September 8, 2025
Large model CPU and RAM requirements General	0	84	September 15, 2025
Run vLLM on two diffrent GPU General	1	626	May 21, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	29	July 16, 2025

How to size LLMs

Related topics