Setting two LLMs on different GPUs in one offline inference script

Masaaki-75 · June 6, 2025, 5:37am

I was trying to do offline inference with two LLMs in one python script. I have 8 GPUs and I would like to put these two LLMs on different devices, for example, the first LLM on GPU 0–3, and the second on GPU 4–7. Since two LLMs are in the same python script, I guess setting os.environ['CUDA_VISIBLE_DEVICES'] won’t work. How should I achieve this? The snippet should look like:

tokenizer1 = AutoTokenizer.from_pretrained(...)
llm1 = LLM(
        model=...,
        tensor_parallel_size=args.tp,
        pipeline_parallel_size=args.pp,
        trust_remote_code=True,
        # some argument to set the device ids
    )
params1 = SamplingParams(...)

tokenizer2 = AutoTokenizer.from_pretrained(...)
llm2 = LLM(
        model=...,
        tensor_parallel_size=args.tp,
        pipeline_parallel_size=args.pp,
        trust_remote_code=True,
        # some argument to set the device ids
    )
params2 = SamplingParams(...)


# do something with LLM1 and LLM2...

RunLLM · June 6, 2025, 5:38am

vLLM does not natively support assigning different LLM instances to specific GPU sets within the same Python process. The recommended approach is to launch each LLM in a separate process, setting CUDA_VISIBLE_DEVICES before each process starts, so each LLM instance only sees its assigned GPUs. There is no supported argument in the LLM constructor to directly set device IDs for each model within a single process, and attempts to do so (e.g., via a device argument) are not effective or supported and may lead to errors or undefined behavior [source1][source2][source3][source4].

To achieve your goal, run two separate scripts or subprocesses, each with its own CUDA_VISIBLE_DEVICES environment variable (e.g., CUDA_VISIBLE_DEVICES=0,1,2,3 for the first, CUDA_VISIBLE_DEVICES=4,5,6,7 for the second), and instantiate one LLM per process. This is the only robust and officially supported method in vLLM for multi-model, multi-GPU assignment [source1][source2][source3][source4]. Would you like a code example for launching subprocesses with different GPU assignments?

Sources:

Topic		Replies	Views
How to run a model use heterogeneous GPUs General	1	405	August 12, 2025
Run vLLM on two diffrent GPU General	1	694	May 21, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	738	May 21, 2025
vLLM + Ray: exclude/include specific GPUs on a running cluster? General	3	155	August 14, 2025
Does vLLM support multiple model_executor? Scheduling	1	289	April 28, 2025

Setting two LLMs on different GPUs in one offline inference script

Related topics