32GB vs 48GB vRam

I get the feeling that today’s small & medium models (like 17B params & less) seem to be created and default-set for running within 32GB vRAM.

Moving up to larger models around 70B params, would 48GB vRAM instead of 32GB vRAM have any practical effect on “making my life easier”? Seems like 70B param model talk swirls around 96GB vRAM and above.

Basically i have one 16GB vram GPU and I’m going to add a second 16GB vram GPU (albeit at a slower memory bandwidth of 448 GB/sec vs my existing 896 GB/sec) to have the seemingly-expected 32GB vRam.

I’m wondering if it’s worth anything adding a third 16GB vrm GPU for 48GB vRam? Or is the next model size up from models that run in 32gb more like 128gb vram and thus a bump from 32 to 48 would have no real benefit.

I think that is the case, that 48gb vram is, practically speaking, not really any real, usable improvement over 32gb vram. I think it’s either 32gb or >=128

vRAM size next step-up from 32gb might be 128gb or maybe 96gb, but whatever, it’s probably much more than 48gb such that 48gb is effectively, IRL, in practice, still just 32gb

I’m thinking “add one gpu for 32gb vram; don’t bother going for 48gb vram.” Am I pretty wrong with this thinking & reasoning why 48 is not really better then 32 in-practice?

On another topic there’s a question about /dev/shm and model sizes. If I go from 16gb to 32 (or 48)gb vram do i need to do anything to increase /dev/shm size? I was thinking that /dev/shm is shared CPU memory and it shouldn’t have any effect on model serving bc model serving is just one CPU app… I think…

thanks!

For 70B parameter models, 32GB or even 48GB VRAM is not enough to run full-precision (FP16/BF16) versions; these typically require 140GB+ VRAM, so you’d need at least 2x80GB A100s or similar for standard weights. Quantized versions (e.g., 4-bit) can fit in 32–48GB, but the jump from 32GB to 48GB VRAM only helps if the quantized model’s weights plus KV cache fit within that range—otherwise, you’ll still be limited to smaller models or need to use more aggressive quantization or offloading. In practice, 48GB is only a modest improvement over 32GB for large models, and the next practical step for full-precision 70B models is 96GB+ VRAM, not 48GB. See this discussion and here.

Regarding /dev/shm: vLLM uses CPU RAM for swap space and temporary storage, but /dev/shm is only relevant if you run out of RAM and vLLM starts using swap (see --swap-space). Increasing VRAM does not require increasing /dev/shm unless you also increase swap usage or run multiple large models that might use shared memory. See this issue for more details.

Would you like a step-by-step breakdown or code/config examples for quantized model deployment?

Sources:

1 Like