推理时遇到的Memory占用问题

longkong39 · February 6, 2026, 3:31am

3090 24G显存跑qwen3-14B模型，遇到如下报错，gpu_memory_utilization设为0.3-0.6都不行
模型初始化代码：
llm = LLM(model=MODEL_NAME,gpu_memory_utilization=0.3)
报错：
(EngineCore_DP0 pid=2178) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 340.00 MiB. GPU 0 has a total capacity of 23.56 GiB of which 31.06 MiB is free. Process 2209924 has 23.52 GiB memory in use. Of the allocated memory 23.19 GiB is allocated by PyTorch, and 18.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
设备信息：
NVIDIA GeForce RTX 3090
显卡内存 24 GB
可使用内存 86 GB

RunLLM · February 6, 2026, 3:31am

3090 24G显存无法加载qwen3-14B模型，主要因为模型本身（14B参数，FP16约28GB）已超出单卡显存，即使gpu_memory_utilization设为0.3-0.6也无法解决，仍会OOM。vLLM的gpu_memory_utilization参数只影响KV cache和部分缓存空间，模型权重本身必须能装下，3090 24G不足以满足需求。建议使用更大显存的GPU，或采用多卡tensor parallel（如2张24G卡，tensor_parallel_size=2），或选择更小的模型。如需进一步节省显存，可尝试量化模型（如int4、int8）或降低max_model_len、max_num_seqs等参数，但根本问题是单卡显存不足，无法仅靠参数调整解决【详见官方文档说明和类似问题讨论】(1, 2, 3)。

如需多卡部署，可加参数tensor_parallel_size=2，并确保CUDA_VISIBLE_DEVICES设置了两张卡。若仍需单卡运行，建议换用7B或更小的模型，或使用int4量化版本（如Qwen3-14B-GPTQ-Int4），但推理精度会有所下降。需要详细操作步骤吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

longkong39 · February 6, 2026, 3:35am

OK，我了解了问题的解决方式，谢谢

Topic		Replies	Views
推理时报错RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} General	1	647	February 5, 2026
How to understand OOM and foresee memory usage General	5	177	April 24, 2026
为什么显存足够却启动不了serve General	3	1214	July 30, 2025
部署qwen3-32b报错，不知道是什么原因 General	3	283	July 30, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1683	March 3, 2026

推理时遇到的Memory占用问题

Related topics