Enable Expert Offloading

How to enable expert offloading when using a MoE LLM for inference?

vLLM requires all experts to be loaded at initialization, just like any other model weights. However, if all experts do not fit on GPU(s), you can use the cpu_offload_gb parameter to offload a specified number of GBs to CPU memory, which will be swapped onto GPU for each inference. Note that this approach is slow and there is no support for on-demand loading of only the gated experts; all experts must be present in memory (GPU or CPU) at startup. On-demand or JIT expert loading is not currently supported in vLLM according to the available documentation and issues.

For more details or code examples, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM As you said “vLLM requires that all experts are loaded at initialization, just like any other weight”, what data is exchanged by enabling the cpu_offload_gb parameter.

When cpu_offload_gb is enabled, model weights—including all MoE experts—are loaded into CPU memory, and the specified amount is offloaded from GPU to CPU. During inference, the required weights (including expert weights) are transferred from CPU to GPU on-the-fly for computation, then released from GPU memory after use. No on-demand or partial expert loading occurs; all expert weights must be present in either GPU or CPU memory at startup, and the data exchanged is the model weights themselves, not activations or intermediate results.

Would you like a step-by-step breakdown or code example?

Sources:

Hint: Mention @RunLLM in the post for followups.