vLLM is best suited for high-memory, data center GPUs such as NVIDIA A100, H100, H200, B200, or enterprise RTX cards with 24GB+ VRAM. Consumer GPUs with 16GB VRAM (like RTX 5060-Ti) are often insufficient for 7B+ models, even with quantization, due to vLLM’s memory allocation strategy (conserving memory guide, hardware recommendations).
Adding a second RTX 5060-Ti (16GB) and using --tensor-parallel-size 2 can help by splitting the model across both GPUs, effectively doubling available memory for model weights and cache. However, this only works if the model supports tensor parallelism and both GPUs are available and compatible (parallelism docs). Some users have reported issues with tensor parallelism on consumer cards, so success may vary.
Would you like a step-by-step guide for multi-GPU setup?
Sources:
Hint: Mention @RunLLM in the post for followups.