torch.OutOfMemoryError: CUDA out of memory

RunLLM · March 3, 2026, 3:58am

The error occurs because vLLM pre-allocates most GPU memory for model weights and KV cache, and the Qwen3.5-9B model in BF16/FP16 precision typically requires more than 16GB VRAM (about 18GB+ for weights and cache). The nvidia-smi output shows only 1GB used because vLLM failed to load the model and exited before allocating the full memory block—so the memory usage you see is not representative of what vLLM actually needs to run the model. The error message means PyTorch tried to allocate more memory than was available, including both PyTorch-managed and other CUDA allocations, and failed due to insufficient VRAM. Ollama works because it uses a quantized (e.g., 4-bit) version of the model, which fits in less memory.

To overcome this, you can:

Use a quantized model (e.g., 4-bit or 8-bit) compatible with vLLM.
Lower the context length (--max-model-len), batch size (--max-num-seqs), or --gpu-memory-utilization.
Try CPU offloading or tensor parallelism if you have multiple GPUs, but with a single 16GB GPU, quantization is the most effective solution. See conserving memory guide for details.

Would you like a step-by-step guide on running a quantized model with vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Help with vLLM crashes General	1	763	December 16, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1222	May 8, 2025
vLLM Tensor Parallel Workers Not Completing Initialization General	5	1551	May 4, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	6747	January 5, 2026
为什么显存足够却启动不了serve General	3	1205	July 30, 2025

torch.OutOfMemoryError: CUDA out of memory

Related topics