I have three NVIDIA CUDA-capable GPUs, and that is where the good-sounding part ends.
First I have an RTX5070ti with its mere 16gb vRAM but it’s at least DDR7 and a 256-bit memory bus. (yeah, oops - the 5090 w/32gb vram and 512-bit bus was obviously the way to go, and now i know that)
I also have an RTX 3060 with 12gb DDR6 vRAM using a 192-bit memory bus
And I have an ancient GTX 1650 with 6gb DDR6 vRAM using a 192-bit memory bus
If I slap all three into a computer that can run 3 cards, first, is that supported by vLLM without too much trouble?
Second, would i even want to do that? Is 34gb of vRAM but limited to DDR6 ~speed and 192-bit bus speed a better way to go with my resources than using the 5070ti alone and working to offload memory from the GPU?
If vLLM would do “3 GPUs” with relatively little hassles, will it make best-use of them? Like maybe the extra vRAM is worht using to achieve 34gb vRAM, but ignore the available, older generations of cuda cores, tensor cores, ray tracing cores, etc (apologies if those overlap)?
Memory bus speed and pcie gen (gen’s speed, specifically) probably have a whole lot to do with the “is it worth it to add my older GPUs into the mix” answer if the answer to “does vLLM do that without too much hassle” is yes, and I think that answer IS yes, so really I’m wondering if it’s worth it to add the two older GPUs alongside the 5070ti for, primarily, the (maybe) benefit of 34gb vram vs 16-but-faster
With just 16gb vRAM, the computer is doing the AI version of paging-to-disk that old computer did when there wasn’t enough plain ole RAM. It’s even still called “swapping” (i deduce, based on vllm’s swap-space engine param), so i bet it’s also still extreeeeemely slow relative to having enough vRAM just like it was when lacking regular RAM.
Hmm I guess really my question comes to this: When setting a swap-space value and another GPU(s) has vRAM available, will vLLM auto-pick the vRAM before plain old CPU oriented RAM? DDR6 vRAM has got to be a better place to go than DD4 regular RAM, and the bus speed comes into play either way, so its irrelevant.
TIA!
EDIT: Or perhaps bus speed is the limiting factor, rendering moot any choice between swapping to “add’l GPU’s vRAM vs CPU RAM”