Deploy a big LLM when GPU VRAM not enough

In Kubernetes environment, we want to deploy a very large LLM model that uses over 200B parameters. It may need over 500GB VRAM. In our node we have available only 192GB VRAM, 4 GPUs in the same node, can we use CPU offloading and utilize the node’s RAM to deploy the model? Sharing between RAM and VRAM. What parameters can be passed in the Kubernetes manifest to achieve this? Of course, inference performance will be very low

Yes, you can use CPU offloading in vLLM to deploy models larger than your available GPU VRAM by utilizing system RAM. The key parameter is --cpu-offload-gb, which specifies how many GB of CPU RAM to use for offloading model weights. For example, with 192GB VRAM and 4 GPUs, you can set --tensor-parallel-size 4 and add --cpu-offload-gb to increase the effective memory available for the model. Note that this requires a fast CPU-GPU interconnect and will significantly reduce inference throughput and increase latency, as weights are streamed between CPU and GPU during inference. Example manifest args:

args:
  - "--model"
  - "your-model-name"
  - "--tensor-parallel-size"
  - "4"
  - "--cpu-offload-gb"
  - "400"

You should adjust --cpu-offload-gb to match the amount of RAM you want to allocate for offloading. For more details, see the official documentation and related discussions.

Would you like a full Kubernetes YAML example or more details on performance implications?

Sources:

Can I use AWQ instead? In that case, in Kubernetes manifest, do I need to pass any arguement?

Yes, you can use AWQ quantized models with vLLM to reduce memory usage. In your Kubernetes manifest, you need to add the argument --quantization awq when launching vLLM with an AWQ model. For example:

args:
  - "--model"
  - "your-awq-model"
  - "--quantization"
  - "awq"

This tells vLLM to load and run the model in AWQ quantized mode. No other special arguments are required for AWQ, but you can combine this with --cpu-offload-gb if you still need to offload to CPU RAM. See AutoAWQ documentation for more details.

Would you like more detail or a full manifest example?

Sources:

For vLLM pods, using enough VRAM for FP16 or AWQ, what is the recommendation for RAM and CPU in pod resources? IS there any relation to the number of parameters to the VRAM used in total (e.g. 196GB VRAM)?

It seems likely that for vLLM pods running in FP16 or AWQ mode (without CPU offloading), the main resource requirement is GPU VRAM, which should be large enough to fit the model weights and KV cache. The RAM and CPU requests in the pod spec are typically much lower—often 16–32GB RAM and 4–8 CPU cores are sufficient for serving, unless you have very high concurrency or multi-modal workloads. There is no strict formula, but RAM/CPU are not directly proportional to model parameter count; VRAM is the main constraint for model size, as model weights and cache must fit there for FP16/AWQ inference. For a 196GB VRAM node, you can generally run a model whose weights and cache fit within that VRAM, regardless of system RAM.

Would you like more detail or a sample resource block for your manifest?

Sources:

When the GPU VRAM is not enough, could we use AWQ quantization with cpu offload deploying the vllm in Kubernetes? Do we add the max RAM GB in arguements?

Yes, you can use AWQ quantization with CPU offload in vLLM to deploy large models when GPU VRAM is insufficient. In your Kubernetes manifest, add both --quantization awq and --cpu-offload-gb <RAM_GB> arguments to specify AWQ mode and the amount of system RAM to use for offloading model weights. For example:

args:
  - "--model"
  - "your-awq-model"
  - "--quantization"
  - "awq"
  - "--cpu-offload-gb"
  - "256"

Set <RAM_GB> to the maximum RAM (in GB) you want vLLM to use for offloading. This enables vLLM to split model weights between VRAM and RAM, but performance will be lower due to CPU-GPU transfers. See AutoAWQ documentation and CPU offload usage.

Would you like more detail or a full manifest example?

Sources:

In this example, the 256GB offload, is it the maximum allocated? We would like vllm to use VRAM first then use whatever is needed from RAM, maybe not the complete 256GB

Yes, --cpu-offload-gb 256 sets the maximum CPU RAM vLLM can use for offloading, but vLLM will always fill GPU VRAM first and only use as much RAM as needed to fit the model weights that exceed VRAM. If the model fits in VRAM, RAM offload is not used; if not, only the required portion of the specified RAM is used, up to the limit you set. You do not need to allocate the full 256GB unless the model requires it—this is just an upper bound.

Would you like more detail or a usage example?

Sources:

I have implemented this:
args:
- “–model”
- “QuantTrio/Qwen3-Coder-480B-A35B-Instruct-AWQ”
- “–tensor-parallel-size”
- “4”
- “–max-model-len” # Reduce for testing/memory efficiency
- “4096” # Start with a more manageable sequence length, e.g., 4096 or 8192
- “–quantization”
- “awq_marlin”
- “–cpu-offload-gb”
- “512”
- “–trust-remote-code”

but I still get error torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 600.00 MiB. GPU 0 has a total capacity of 44.40 GiB of which 361.00 MiB is free. Including non-PyTorch memory, this process has 42.76 GiB memory in use. Of the allocated memory 42.03 GiB is allocated by PyTorch, and 145.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation

Even with AWQ quantization and --cpu-offload-gb, vLLM still requires enough VRAM to hold part of the model and the KV cache. If the model and cache together exceed available VRAM, you will get CUDA OOM errors. The --cpu-offload-gb only offloads model weights, not the KV cache, which must fit in VRAM. For extremely large models like Qwen3-Coder-480B, even with quantization and offload, 44GB VRAM per GPU may not be enough, especially with a sequence length of 4096 and 4-way tensor parallelism. See AutoAWQ documentation and CPU offload usage.

Would you like more detail on how to further reduce memory usage or debug this?

Sources:

How can I reduce memory usage in this case? Which parameters to modify? It will be for 1 user, 1 batch. Should I set this PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True too?

To reduce memory usage for a single user and batch, lower --max-model-len (e.g., 1024 or 2048), set --max-num-seqs 1, and reduce --gpu-memory-utilization (e.g., 0.7). These changes shrink the KV cache and overall VRAM usage. Also, setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True can help avoid fragmentation and may prevent OOM errors, as suggested in multiple vLLM issues.

Example args:

args:
  - "--max-model-len"
  - "1024"
  - "--max-num-seqs"
  - "1"
  - "--gpu-memory-utilization"
  - "0.7"

And set the environment variable in your pod:

env:
  - name: PYTORCH_CUDA_ALLOC_CONF
    value: "expandable_segments:True"

Would you like more detail or a full manifest example?

Sources:

what is the default value of - “–max-num-seqs” ? If I increase it to 5 will it have huge impact to the required memory ?

The default value of --max-num-seqs in vLLM V1 is 1024, but it may be set to 256 in some recent versions or specific configurations. Increasing --max-num-seqs from 1 to 5 will increase memory usage, as it controls the maximum number of concurrent sequences (batches) and directly affects the size of the KV cache in VRAM. The increase is roughly linear, so memory usage for 5 will be about 5 times that for 1, but still much less than the default of 1024 or 256. See Optimization and Tuning and Conserving Memory.

Would you like more detail on how to estimate the exact memory usage?

Sources:

The LLMs deployed with vLLM (quantized or not), do they have tooling enabled by default or we need to pass some parameters in Kubernetes manifest for this?

Tool calling is not enabled by default in vLLM. You must explicitly enable it by passing --enable-auto-tool-choice and --tool-call-parser <parser_name> as arguments in your Kubernetes manifest. The parser name depends on your model (e.g., hermes, llama, mistral, etc.). See Tool Calling documentation for details.

Would you like a manifest example or more information on supported parsers?

Sources:

For Qwen models, do I just pass those 2 parameters --tool-call-parser hermes and enable-auto-tool-choice?

Yes, for Qwen models, you just need to pass --enable-auto-tool-choice and --tool-call-parser hermes to enable tool calling. The Qwen tokenizer config already supports Hermes-style tool use, so no extra chat template is required. See Qwen Models Tool Calling.

Would you like more detail or a manifest example?

Sources: