An issue about using multiple GPU to deploy multiple models with vllm

billfung · May 28, 2026, 9:24am

I have 3 GPUS in my comupter:
0: Tesla V100 32G
1: Nvidia 3090 24G
2: Nvida 3090 24G

2 models to deploy:

using llama.cpp to deploy Qwen3.6-35B-A3B-Q4-GGUF on V100.
using vllm to deploy Qwen3.6-27B-AWQ-4Bit on double 3090.

issue:

I ran llama.cpp first to deploy on V100, the model successfully deployed and occupied almost 20G.
I ran vllm to deploy on double 3090, it will report that not enough memory on GPU 0 V100, while I clearly using CUDA_VISIBLE_DEVICES to declare running on GPU1 and GPU2.

If I exchange the procedure, running vllm first and llama.cpp next, everything works.

llama.cpp script:
export CUDA_VISIBLE_DEVICES=0

nohup ./llama-server
-m /work/models/Qwen3.6-27B-MTP/Qwen3.6-27B-MTP-Q4_K_M.gguf
–alias “qwen3.6-27b”
–api-key “------”
–port 3002
–host 0.0.0.0
-ngl 99
-c 150000
–spec-type draft-mtp
–spec-draft-n-max 2
–flash-attn on
–cache-type-k q4_0
–cache-type-v q4_0
–parallel 1
–no-mmap &

************************************************
vllm script:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
#export FLASH_ATTENTION_FORCE_TRITON=1
export VLLM_USE_CUSTOM_KERNELS=1
export CUDA_LAUNCH_BLOCKING=0
export CUDA_AUTO_BOOST=0
export CUDA_VISIBLE_DEVICES=1,2
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
#VLLM_USE_MODELSCOPE=true nohup vllm serve “/work/models/Qwen36_27B_AWQ_4Bit” \

#VLLM_USE_MODELSCOPE=true nohup vllm serve “/work/models/Qwen36-35B-A3B”
CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_MODELSCOPE=true nohup vllm serve “/work/models/Qwen36_27B_AWQ_4Bit”
–host 0.0.0.0
–port 3001
–gpu-memory-utilization 0.85
–served-model-name “qwen3.6-27b”
–max-model-len 155144
–tensor-parallel-size 2
–block-size 16
–max-num-seqs 32
–allowed-local-media-path “/home/swzx/data/”
–api-key ----------------------------
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–default-chat-template-kwargs ‘{“enable_thinking”:false}’
–speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:4}’
–limit-mm-per-prompt ‘{“image”:30,“video”:0}’ &

RunLLM · May 28, 2026, 9:24am

It seems likely that vLLM is still attempting to allocate memory on GPU 0 (V100) even when CUDA_VISIBLE_DEVICES=1,2 is set, possibly due to CUDA context initialization order or a lingering process. This issue is not explicitly documented in the vLLM or llama.cpp guides, but similar problems have been reported when multiple frameworks are used together and CUDA_VISIBLE_DEVICES is not respected by all subprocesses or libraries, especially if CUDA context is initialized before setting the variable or if other processes are using the GPU (source, source).

My understanding is that running vLLM first works because it initializes CUDA contexts on the specified GPUs before llama.cpp, preventing llama.cpp from interfering. If llama.cpp is started first, it may lock the CUDA context or memory on GPU 0, causing vLLM to fail even if you specify GPUs 1 and 2. Ensuring that no CUDA context is initialized on GPU 0 before running vLLM, and double-checking that no other processes are using GPU 0, may help. Would you like more detail or troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Does vllm support deploy multiple docker instance on one gpu General	1	988	May 21, 2025
2 vllm containers on a single GPU General	3	2327	August 27, 2025
Setting two LLMs on different GPUs in one offline inference script General	1	589	June 6, 2025
Run vLLM on two diffrent GPU General	1	1157	May 21, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	2494	March 3, 2026

An issue about using multiple GPU to deploy multiple models with vllm

Related topics