Does vLLM support multiple model_executor?

leeeizhang · April 28, 2025, 9:53am

Hi, all.

My model is relatively small-scale and doesn’t require large tensor parallelism (TP) for inference (TP=2 is enough for this model). However, my hardware instance has 8 GPUs. How can I configure multiple executors to utilize the full GPU capacity while expose only single vLLM service?

Specifically, I want to group GPUs as [gpu:0,1: executor_0], [gpu:2,3: executor_1], etc., with a single LLM engine managing these executor groups, so it will be more easily to batching larger requests.

RunLLM · April 28, 2025, 9:53am

It seems likely that you can achieve this by running multiple instances of vLLM, each using a subset of the available GPUs, and then using a load balancer to manage requests across these instances. According to vLLM documentation, you can specify which GPUs to use for each instance by setting the CUDA_VISIBLE_DEVICES environment variable. For example, you can start one instance with CUDA_VISIBLE_DEVICES=0,1 and another with CUDA_VISIBLE_DEVICES=2,3.

To expose a single vLLM service, you can use a load balancer or a proxy server to distribute incoming requests to the different vLLM instances. This setup allows you to utilize all GPUs while maintaining a single service endpoint. Would you like more detailed guidance on setting up the load balancer or proxy server?

Sources:

Distributed Inference and Serving

Topic		Replies	Views
Setting two LLMs on different GPUs in one offline inference script General	1	19	June 6, 2025
How to load the model successfully through multi-card in vllm? General	5	112	April 3, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	92	May 21, 2025
2 vllm containers on a single GPU General	1	127	May 13, 2025
Run vLLM on two diffrent GPU General	1	85	May 21, 2025

Does vLLM support multiple model_executor?

Related topics