Yes, when running multiple vLLM servers (each serving a different model) with tensor parallelism on the same GPU(s), you should set the --gpu-memory-utilization parameter for each instance so that their total does not exceed 1.0 (100%). For example, if you run two servers, you might set --gpu-memory-utilization=0.5 for each. This ensures each server only uses its allocated fraction of GPU memory and prevents out-of-memory errors. This applies per instance and is required for stable multi-model, multi-server deployments with tensor parallelism or otherwise, as confirmed in the documentation and issues.
Would you like more detail on how to calculate or set these values in practice?
Sources:
- vLLM Engine Args: --gpu-memory-utilization
- GitHub Issue: Engine V1 When loading two models into the same GPU the second model requires more memory allocation than the first
- GitHub Issue: Have you considered sharing KV cache of a single GPU across multiple models?
- GitHub Issue: Support multiple models per GPU