Making best use of varying GPU generations

I have three NVIDIA CUDA-capable GPUs, and that is where the good-sounding part ends.

First I have an RTX5070ti with its mere 16gb vRAM but it’s at least DDR7 and a 256-bit memory bus. (yeah, oops - the 5090 w/32gb vram and 512-bit bus was obviously the way to go, and now i know that)

I also have an RTX 3060 with 12gb DDR6 vRAM using a 192-bit memory bus

And I have an ancient GTX 1650 with 6gb DDR6 vRAM using a 192-bit memory bus

If I slap all three into a computer that can run 3 cards, first, is that supported by vLLM without too much trouble?
Second, would i even want to do that? Is 34gb of vRAM but limited to DDR6 ~speed and 192-bit bus speed a better way to go with my resources than using the 5070ti alone and working to offload memory from the GPU?
If vLLM would do “3 GPUs” with relatively little hassles, will it make best-use of them? Like maybe the extra vRAM is worht using to achieve 34gb vRAM, but ignore the available, older generations of cuda cores, tensor cores, ray tracing cores, etc (apologies if those overlap)?

Memory bus speed and pcie gen (gen’s speed, specifically) probably have a whole lot to do with the “is it worth it to add my older GPUs into the mix” answer if the answer to “does vLLM do that without too much hassle” is yes, and I think that answer IS yes, so really I’m wondering if it’s worth it to add the two older GPUs alongside the 5070ti for, primarily, the (maybe) benefit of 34gb vram vs 16-but-faster

With just 16gb vRAM, the computer is doing the AI version of paging-to-disk that old computer did when there wasn’t enough plain ole RAM. It’s even still called “swapping” (i deduce, based on vllm’s swap-space engine param), so i bet it’s also still extreeeeemely slow relative to having enough vRAM just like it was when lacking regular RAM.

Hmm I guess really my question comes to this: When setting a swap-space value and another GPU(s) has vRAM available, will vLLM auto-pick the vRAM before plain old CPU oriented RAM? DDR6 vRAM has got to be a better place to go than DD4 regular RAM, and the bus speed comes into play either way, so its irrelevant.

TIA!

EDIT: Or perhaps bus speed is the limiting factor, rendering moot any choice between swapping to “add’l GPU’s vRAM vs CPU RAM”

More succinctly, when adding additional GPUs within one machine, are they worked into the mix as ~black-boxes that you give inputs and it churns however its innards churn and then gives you output?

Or are are add’l GPUs added ~piecemeal as additional physical resources per physical area of an effectively virtualized GPU?

My guess that add’l GPUs are added as black-boxes and they handle their own stuff as if they were the only GPU and something outside the GPUs orchestrates (K8s maybe, i think, possibly, from what little I’ve read?)

I bumped into Nvidia NCCL this morning while trying to build pytorch from source including a run of nightly.py. Last thing I see from the build is “checking out nccl release tag: v2.26.2-1” which led me to look up NCCL. [Edit: NCCL doesn’t seem to be related to my build failure, and my build failure is entirely OT, so NM about that]

Lo and behold, Nvidia themselves provide the “additional GPUs within one machine” handler. I’m sure that’s assuming all the GPUs are Nvidia, but my 3 are. Will look into nccl to figure out more about how resources are used across varying capability GPUs.

And then in yesterday’s vLLM office hours I heard mention of distributed processing across multiple machines which is, IIUC, present (and/0or greatly enchanced in vLLM v1)

Just FYI for any other n00b who is DuckDuckGoing around someday for something about multi-GPUs in on machine or across 2+ machines and lands here.

If any currently-knowledgeable folks read something misunderstood (by me), misleading or just flat wrong here, please chime in for me and future readers.