My model is relatively small-scale and doesn’t require large tensor parallelism (TP) for inference (TP=2 is enough for this model). However, my hardware instance has 8 GPUs. How can I configure multiple executors to utilize the full GPU capacity while expose only single vLLM service?
Specifically, I want to group GPUs as [gpu:0,1: executor_0], [gpu:2,3: executor_1], etc., with a single LLM engine managing these executor groups, so it will be more easily to batching larger requests.
It seems likely that you can achieve this by running multiple instances of vLLM, each using a subset of the available GPUs, and then using a load balancer to manage requests across these instances. According to vLLM documentation, you can specify which GPUs to use for each instance by setting the CUDA_VISIBLE_DEVICES environment variable. For example, you can start one instance with CUDA_VISIBLE_DEVICES=0,1 and another with CUDA_VISIBLE_DEVICES=2,3.
To expose a single vLLM service, you can use a load balancer or a proxy server to distribute incoming requests to the different vLLM instances. This setup allows you to utilize all GPUs while maintaining a single service endpoint. Would you like more detailed guidance on setting up the load balancer or proxy server?