How to enable expert offloading when using a MoE LLM for inference?

vLLM requires all experts to be loaded at initialization, just like any other model weights. However, if all experts do not fit on GPU(s), you can use the cpu_offload_gb parameter to offload a specified number of GBs to CPU memory, which will be swapped onto GPU for each inference. Note that this ap…

When cpu_offload_gb is enabled, model weights—including all MoE experts—are loaded into CPU memory, and the specified amount is offloaded from GPU to CPU. During inference, the required weights (including expert weights) are transferred from CPU to GPU on-the-fly for computation, then released from …

Enable Expert Offloading

General

Lambmietea November 11, 2025, 9:34am 3

@RunLLM As you said “vLLM requires that all experts are loaded at initialization, just like any other weight”, what data is exchanged by enabling the cpu_offload_gb parameter.

Topic		Replies	Views
Expert offloading Features	1	621	November 11, 2025
Do the current MoE models support setting LoRA adapters on expert layers? General	2	571	October 15, 2025
[Question] Confirmation on Physical Data Movement and Runtime Dynamics in rearrange_expert_weights_inplace General	5	70	November 24, 2025
Expert Parallelism All-to-All Communication without NVLink and DeepEP General	3	327	March 3, 2026
Why not add a self prefix ？ General	3	28	November 1, 2025

Enable Expert Offloading

Related topics