I was trying to do offline inference with two LLMs in one python script. I have 8 GPUs and I would like to put these two LLMs on different devices, for example, the first LLM on GPU 0–3, and the second on GPU 4–7. Since two LLMs are in the same python script, I guess setting os.environ['CUDA_VISIBLE_DEVICES'] won’t work. How should I achieve this? The snippet should look like:
tokenizer1 = AutoTokenizer.from_pretrained(...)
llm1 = LLM(
model=...,
tensor_parallel_size=args.tp,
pipeline_parallel_size=args.pp,
trust_remote_code=True,
# some argument to set the device ids
)
params1 = SamplingParams(...)
tokenizer2 = AutoTokenizer.from_pretrained(...)
llm2 = LLM(
model=...,
tensor_parallel_size=args.tp,
pipeline_parallel_size=args.pp,
trust_remote_code=True,
# some argument to set the device ids
)
params2 = SamplingParams(...)
# do something with LLM1 and LLM2...
vLLM does not natively support assigning different LLM instances to specific GPU sets within the same Python process. The recommended approach is to launch each LLM in a separate process, setting CUDA_VISIBLE_DEVICES before each process starts, so each LLM instance only sees its assigned GPUs. There is no supported argument in the LLM constructor to directly set device IDs for each model within a single process, and attempts to do so (e.g., via a device argument) are not effective or supported and may lead to errors or undefined behavior [source1][source2][source3][source4].
To achieve your goal, run two separate scripts or subprocesses, each with its own CUDA_VISIBLE_DEVICES environment variable (e.g., CUDA_VISIBLE_DEVICES=0,1,2,3 for the first, CUDA_VISIBLE_DEVICES=4,5,6,7 for the second), and instantiate one LLM per process. This is the only robust and officially supported method in vLLM for multi-model, multi-GPU assignment [source1][source2][source3][source4]. Would you like a code example for launching subprocesses with different GPU assignments?
Sources: