torch.OutOfMemoryError: CUDA out of memory

The error occurs because vLLM pre-allocates most GPU memory for model weights and KV cache, and the Qwen3.5-9B model in BF16/FP16 precision typically requires more than 16GB VRAM (about 18GB+ for weights and cache). The nvidia-smi output shows only 1GB used because vLLM failed to load the model and exited before allocating the full memory block—so the memory usage you see is not representative of what vLLM actually needs to run the model. The error message means PyTorch tried to allocate more memory than was available, including both PyTorch-managed and other CUDA allocations, and failed due to insufficient VRAM. Ollama works because it uses a quantized (e.g., 4-bit) version of the model, which fits in less memory.

To overcome this, you can:

  • Use a quantized model (e.g., 4-bit or 8-bit) compatible with vLLM.
  • Lower the context length (--max-model-len), batch size (--max-num-seqs), or --gpu-memory-utilization.
  • Try CPU offloading or tensor parallelism if you have multiple GPUs, but with a single 16GB GPU, quantization is the most effective solution. See conserving memory guide for details.

Would you like a step-by-step guide on running a quantized model with vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.