3090 24G显存跑qwen3-14B模型,遇到如下报错,gpu_memory_utilization设为0.3-0.6都不行
模型初始化代码:
llm = LLM(model=MODEL_NAME,gpu_memory_utilization=0.3)
报错:
(EngineCore_DP0 pid=2178) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 340.00 MiB. GPU 0 has a total capacity of 23.56 GiB of which 31.06 MiB is free. Process 2209924 has 23.52 GiB memory in use. Of the allocated memory 23.19 GiB is allocated by PyTorch, and 18.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
设备信息:
NVIDIA GeForce RTX 3090
显卡内存 24 GB
可使用内存 86 GB
3090 24G显存无法加载qwen3-14B模型,主要因为模型本身(14B参数,FP16约28GB)已超出单卡显存,即使gpu_memory_utilization设为0.3-0.6也无法解决,仍会OOM。vLLM的gpu_memory_utilization参数只影响KV cache和部分缓存空间,模型权重本身必须能装下,3090 24G不足以满足需求。建议使用更大显存的GPU,或采用多卡tensor parallel(如2张24G卡,tensor_parallel_size=2),或选择更小的模型。如需进一步节省显存,可尝试量化模型(如int4、int8)或降低max_model_len、max_num_seqs等参数,但根本问题是单卡显存不足,无法仅靠参数调整解决【详见官方文档说明和类似问题讨论】(1, 2, 3)。
如需多卡部署,可加参数tensor_parallel_size=2,并确保CUDA_VISIBLE_DEVICES设置了两张卡。若仍需单卡运行,建议换用7B或更小的模型,或使用int4量化版本(如Qwen3-14B-GPTQ-Int4),但推理精度会有所下降。需要详细操作步骤吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
OK, 我了解了问题的解决方式,谢谢 ![]()