Enable Expert Offloading

Lambmietea · November 11, 2025, 9:28am

How to enable expert offloading when using a MoE LLM for inference?

RunLLM · November 11, 2025, 9:28am

vLLM requires all experts to be loaded at initialization, just like any other model weights. However, if all experts do not fit on GPU(s), you can use the cpu_offload_gb parameter to offload a specified number of GBs to CPU memory, which will be swapped onto GPU for each inference. Note that this approach is slow and there is no support for on-demand loading of only the gated experts; all experts must be present in memory (GPU or CPU) at startup. On-demand or JIT expert loading is not currently supported in vLLM according to the available documentation and issues.

For more details or code examples, let me know.

Sources:

efd9b644: vLLM requires that all experts are loaded at initialization, just like any other weight. If you don’t have enough space to fit all your weights, you can use the cpu_offload_gb parameter to choose how many GBs to offload to CPU memory to be swapped onto GPU for each inference. Note that this is slow

Hint: Mention @RunLLM in the post for followups.

Lambmietea · November 11, 2025, 9:34am

@RunLLM As you said “vLLM requires that all experts are loaded at initialization, just like any other weight”, what data is exchanged by enabling the cpu_offload_gb parameter.

RunLLM · November 11, 2025, 9:35am

When cpu_offload_gb is enabled, model weights—including all MoE experts—are loaded into CPU memory, and the specified amount is offloaded from GPU to CPU. During inference, the required weights (including expert weights) are transferred from CPU to GPU on-the-fly for computation, then released from GPU memory after use. No on-demand or partial expert loading occurs; all expert weights must be present in either GPU or CPU memory at startup, and the data exchanged is the model weights themselves, not activations or intermediate results.

Would you like a step-by-step breakdown or code example?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Expert offloading Features	1	633	November 11, 2025
Do the current MoE models support setting LoRA adapters on expert layers? General	2	593	October 15, 2025
[Question] Confirmation on Physical Data Movement and Runtime Dynamics in rearrange_expert_weights_inplace General	5	81	November 24, 2025
Expert Parallelism All-to-All Communication without NVLink and DeepEP General	3	359	March 3, 2026
Why not add a self prefix ？ General	1	31	October 15, 2025

Enable Expert Offloading

Related topics