I have 3 GPUS in my comupter:
0: Tesla V100 32G
1: Nvidia 3090 24G
2: Nvida 3090 24G
2 models to deploy:
- using llama.cpp to deploy Qwen3.6-35B-A3B-Q4-GGUF on V100.
- using vllm to deploy Qwen3.6-27B-AWQ-4Bit on double 3090.
issue:
- I ran llama.cpp first to deploy on V100, the model successfully deployed and occupied almost 20G.
- I ran vllm to deploy on double 3090, it will report that not enough memory on GPU 0 V100, while I clearly using CUDA_VISIBLE_DEVICES to declare running on GPU1 and GPU2.
If I exchange the procedure, running vllm first and llama.cpp next, everything works.
llama.cpp script:
export CUDA_VISIBLE_DEVICES=0
nohup ./llama-server
-m /work/models/Qwen3.6-27B-MTP/Qwen3.6-27B-MTP-Q4_K_M.gguf
–alias “qwen3.6-27b”
–api-key “------”
–port 3002
–host 0.0.0.0
-ngl 99
-c 150000
–spec-type draft-mtp
–spec-draft-n-max 2
–flash-attn on
–cache-type-k q4_0
–cache-type-v q4_0
–parallel 1
–no-mmap &
************************************************
vllm script:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
#export FLASH_ATTENTION_FORCE_TRITON=1
export VLLM_USE_CUSTOM_KERNELS=1
export CUDA_LAUNCH_BLOCKING=0
export CUDA_AUTO_BOOST=0
export CUDA_VISIBLE_DEVICES=1,2
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
#VLLM_USE_MODELSCOPE=true nohup vllm serve “/work/models/Qwen36_27B_AWQ_4Bit” \
#VLLM_USE_MODELSCOPE=true nohup vllm serve “/work/models/Qwen36-35B-A3B”
CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_MODELSCOPE=true nohup vllm serve “/work/models/Qwen36_27B_AWQ_4Bit”
–host 0.0.0.0
–port 3001
–gpu-memory-utilization 0.85
–served-model-name “qwen3.6-27b”
–max-model-len 155144
–tensor-parallel-size 2
–block-size 16
–max-num-seqs 32
–allowed-local-media-path “/home/swzx/data/”
–api-key ----------------------------
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–default-chat-template-kwargs ‘{“enable_thinking”:false}’
–speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:4}’
–limit-mm-per-prompt ‘{“image”:30,“video”:0}’ &