For 70B parameter models, 32GB or even 48GB VRAM is not enough to run full-precision (FP16/BF16) versions; these typically require 140GB+ VRAM, so you’d need at least 2x80GB A100s or similar for standard weights. Quantized versions (e.g., 4-bit) can fit in 32–48GB, but the jump from 32GB to 48GB VRAM only helps if the quantized model’s weights plus KV cache fit within that range—otherwise, you’ll still be limited to smaller models or need to use more aggressive quantization or offloading. In practice, 48GB is only a modest improvement over 32GB for large models, and the next practical step for full-precision 70B models is 96GB+ VRAM, not 48GB. See this discussion and here.
Regarding /dev/shm: vLLM uses CPU RAM for swap space and temporary storage, but /dev/shm is only relevant if you run out of RAM and vLLM starts using swap (see --swap-space). Increasing VRAM does not require increasing /dev/shm unless you also increase swap usage or run multiple large models that might use shared memory. See this issue for more details.
Would you like a step-by-step breakdown or code/config examples for quantized model deployment?
Sources: