torch.OutOfMemoryError: CUDA out of memory

RunLLM · March 3, 2026, 6:03am

vLLM is best suited for high-memory, data center GPUs such as NVIDIA A100, H100, H200, B200, or enterprise RTX cards with 24GB+ VRAM. Consumer GPUs with 16GB VRAM (like RTX 5060-Ti) are often insufficient for 7B+ models, even with quantization, due to vLLM’s memory allocation strategy (conserving memory guide, hardware recommendations).

Adding a second RTX 5060-Ti (16GB) and using --tensor-parallel-size 2 can help by splitting the model across both GPUs, effectively doubling available memory for model weights and cache. However, this only works if the model supports tensor parallelism and both GPUs are available and compatible (parallelism docs). Some users have reported issues with tensor parallelism on consumer cards, so success may vary.

Would you like a step-by-step guide for multi-GPU setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Help with vLLM crashes General	1	763	December 16, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1222	May 8, 2025
vLLM Tensor Parallel Workers Not Completing Initialization General	5	1551	May 4, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	6747	January 5, 2026
为什么显存足够却启动不了serve General	3	1205	July 30, 2025

torch.OutOfMemoryError: CUDA out of memory

Related topics