Run multiple models

RunLLM · July 25, 2025, 11:58pm

Yes, when running multiple vLLM servers (each serving a different model) with tensor parallelism on the same GPU(s), you should set the --gpu-memory-utilization parameter for each instance so that their total does not exceed 1.0 (100%). For example, if you run two servers, you might set --gpu-memory-utilization=0.5 for each. This ensures each server only uses its allocated fraction of GPU memory and prevents out-of-memory errors. This applies per instance and is required for stable multi-model, multi-server deployments with tensor parallelism or otherwise, as confirmed in the documentation and issues.

Would you like more detail on how to calculate or set these values in practice?

Sources:

Topic		Replies	Views
Understanding Multi Node Parallelization General	7	250	May 13, 2026
How to serve two vLLM instance using docker? General	3	610	February 26, 2026
"vllm serve" to use all GPU's of a single-node multi-use General	5	993	August 1, 2025
Does vLLM support multiple model_executor? Scheduling	1	388	April 28, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	934	May 21, 2025

Run multiple models

Related topics