Run multiple models

Yes, when running multiple vLLM servers (each serving a different model) with tensor parallelism on the same GPU(s), you should set the --gpu-memory-utilization parameter for each instance so that their total does not exceed 1.0 (100%). For example, if you run two servers, you might set --gpu-memory-utilization=0.5 for each. This ensures each server only uses its allocated fraction of GPU memory and prevents out-of-memory errors. This applies per instance and is required for stable multi-model, multi-server deployments with tensor parallelism or otherwise, as confirmed in the documentation and issues.

Would you like more detail on how to calculate or set these values in practice?

Sources: