I want to deploy an LLM model and I want to know exactly how many GPUs and VRAM should be assigned based on max context length, mean tokens, batches. Is there any tool that provide sizing report for a model by adjusting the model parameters
vLLM provides a Colab-based calculator that estimates the required VRAM and number of chips/GPUs for a given model, context length, and batch size. This tool reports KV cache size per token/request, model weight memory, allocated KV cache memory, and the approximate maximum number of concurrent requests (–max-num-seqs) you can set. You can find the calculator linked in the official documentation under TPU optimization tips. Adjusting parameters in the calculator gives you a sizing report for your deployment needs. See TPU Optimization Tips for details and the calculator link.
Would you like more detail on how to use the calculator or interpret its results?
Sources:
If I use max number batches and max context length in the arguments during the deployment of the vllm pod, will it seize the max vram required from the beginning based on those settings or vram will be seized dynamically when it is needed?
The gpu-memory-utilization is 0.9 be default, is this for KV Cache and weight only or also activations?