How to enable expert offloading when using a MoE LLM for inference?
vLLM requires all experts to be loaded at initialization, just like any other model weights. However, if all experts do not fit on GPU(s), you can use the cpu_offload_gb parameter to offload a specified number of GBs to CPU memory, which will be swapped onto GPU for each inference. Note that this approach is slow and there is no support for on-demand loading of only the gated experts; all experts must be present in memory (GPU or CPU) at startup. On-demand or JIT expert loading is not currently supported in vLLM according to the available documentation and issues.
For more details or code examples, let me know.
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM As you said “vLLM requires that all experts are loaded at initialization, just like any other weight”, what data is exchanged by enabling the cpu_offload_gb parameter.
When cpu_offload_gb is enabled, model weights—including all MoE experts—are loaded into CPU memory, and the specified amount is offloaded from GPU to CPU. During inference, the required weights (including expert weights) are transferred from CPU to GPU on-the-fly for computation, then released from GPU memory after use. No on-demand or partial expert loading occurs; all expert weights must be present in either GPU or CPU memory at startup, and the data exchanged is the model weights themselves, not activations or intermediate results.
Would you like a step-by-step breakdown or code example?
Sources:
- vLLM requires that all experts are loaded at initialization, just like any other weight. If you don’t have enough space to fit all your weights, you can use the cpu_offload_gb parameter to choose how many GBs to offload to CPU memory to be swapped onto GPU for each inference. Note that this is slow
- cpu_offload_gb: The space in GiB to offload to CPU, per GPU. … as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.
- cpu_offload_gb: … model is loaded from CPU memory to GPU memory on the fly in each model forward pass.
Hint: Mention @RunLLM in the post for followups.