I attended the 8 April 2025 vLLM office hours. A lot of the talk sounded like topic(s) covered effectively supplant the need for v0’s --cpu-offload-gb param.
For my 16GB vRAM Nvidia RTX 5070ti and my dummy/n00b status, I get the idea of quantization and the idea that I should retrieve optimized models from RedHat on HuggingFace.
Beyond those two takeaways that I comprendo’ed as a n00b/dummy, mostly I felt like I was just listening to talking using words and phrases (vLLingo lol) so I can be accustomed to hearing & reading them without feeling scared by them.
Since a lot of the talk was over my head (for now) but i srsly need an alternative to --cpu-offload-gb RIGHT NOW, is there an office hours recap I can review, and if not, can the main points be placed here for how those of us with vRAM envy can make do with our tiny vRams?
I ask bc those of us with tiny vRAMs will (i bet) 99.9% overlap dummy-n00bs on the vLLM-user Venn diagram.
For example - I am pretty sure I launched a Phi-4 one time and it worked locally, but ever since that one time, I’ve not been able to make it start at all. KVCache for full sized KVs use up my whole 16GB vRAM. I can get it down to about 4GB VRAM but with 12GB being taken (with 4 of the 12 only being “reserved” but not used yet also not available), I need another 1G vRAM to … get past this “not enough memory” error and probably on to the next one.
So, not complaining!!! But can we get a recap/overview (or point me at it) for the office hours and how to work around limited GPU memory? Yeah the OH was about Terabyte sized models and sixteen of $1.5 million GPUs, but the memory management theory and even the params ought to be the same (excepting for distribution across GPUs and/or add’l machines) for working with 16gb vRAM and an 12+B param model (or with Phi 4 the model size isn’t that big but other things sure seem to be; and let me tell you, IME a significantly shorter context length makes for a very very dumb Phi-4 (coming from HF/MS, not HF/RedHat)
Thanks for any advice!
EDIT: I can re-watch the office hours video, but it will again just be words I am working toward learning to hear without them inducing anxiety. I’m more hoping for something akin to a hyperlink-annotated recap, like maybe the main points, very briefly whatabout each point, and hyperlink(s) to get myself lost in