I know model weight and KVcache are included.And is there anything others is included?
one more question, if the kv-cache is too large, what will happen? for example, 100G GPU memroy is set for weight and KVcache,but one time the KVcache is too large,then overhead 100G, what will happen?
The gpu_memory_utilization parameter in vLLM controls the fraction of total GPU memory reserved for model weights, KV cache, activation memory (intermediate tensors), and system overhead (e.g., CUDA, PyTorch, NCCL, CUDA graphs). It is not just model weights and KV cache; activation buffers and non-torch memory are also included in this allocation. See docs and API reference.
If the KV cache is too large and exceeds the reserved memory (e.g., you set gpu_memory_utilization so high that the sum of weights, activations, and KV cache needs more than available GPU memory), you will get an out-of-memory (OOM) error and the process will fail. vLLM does not automatically spill excess KV cache to CPU unless you explicitly configure offloading; otherwise, OOM occurs. See troubleshooting and issue discussion.
Would you like more detail on memory profiling or how to tune these parameters?