I have heard numerous times that vLLM is the way to go, especially for dual GPU setups. However I am really struggling setting this up and I’m not sure why.
Hardware:
CPU: AMD EPYC 7402p
RAM: 256GB Octo-Channel DDR4
GPU0: Nvidia RTX 2070 (8GB)
GPU1: Nvidia RTX 4060Ti (8GB)
Model Storage: 1TB Crucial P5 NVMe PCIe Gen3 (Dedicated for Ai Models)
Software:
OS: Unraid 7.1.2
Linux Kernel: 6.12.24-Unraid
Install Method: Docker vllm/vllm-openai:latest
Nvidia Driver Version: 575.51.02
LLM GUI: Open WebUI v0.6.12
Issues
Regardless of model used, I get an error about a missing chat template. I have even tried downloading templates manually and specifying the path in the command.
The few times I did get vLLM to work, the performance was actually worse than Ollama was on a single AMD WX9100. And thats a pretty old GPU.
Im not 100% sure where to go from here, I have tried digging through the docs, and even when I do get the CUDA OOM errors to stop, I still get the chat template error.
vLLM requires a chat template for chat-style models; most base models do not include one, so you must provide it manually using the --chat-template argument. If you see a “missing chat template” error, it means the model lacks a built-in template and vLLM cannot format chat prompts automatically—this is expected for non-instruction-tuned models. You can find or create a chat template (e.g., from HuggingFace or vLLM examples) and specify its path with --chat-template; see docs and example usage.
Performance issues may be due to model choice, quantization, or suboptimal parallelism. Also, vLLM’s multi-GPU support works best with identical GPUs; mixing a 2070 and 4060Ti may cause imbalances or errors. For CUDA OOM, try reducing --max-model-len or --gpu-memory-utilization, and ensure your model fits within the combined VRAM. For more troubleshooting, see troubleshooting guide.
Would you like step-by-step instructions for setting up a chat template or optimizing multi-GPU performance?
I think I figured out the chat template issue. But performance and stability, I’m completely lost on what would work best with my setup.
The part that makes it the most confusing is that Ollama on my WX9100, which is a 16GB card from 2017, could run Gemma3:27B…fine. It was usable, but far from fast. Now with the dual GPUs which are orders of magnitude more powerful, and combined still have 16GB of vRAM, in Ollama Gemma3:27B in unusable.
After doing tons of reading the unanimous answer for multiple GPU’s is vLLM, but it has an overwhelming number of knobs and switches and trying to dive in headfirst with two mismatched GPUs is not easy.
I legitimately hate being “that guy” but would you or anyone else be able to help with some sane settings that can let me utilize these cards effectively in parallel?