Expert offloading

Diffizle · November 11, 2025, 6:43am

does vllm currently support expert offloading for moe models?

RunLLM · November 11, 2025, 6:44am

vLLM does not currently support on-demand or dynamic expert offloading for MoE models; all experts must be loaded into memory at initialization. If the model does not fit in GPU memory, you can use the cpu_offload_gb parameter to offload some weights to CPU, but this is slow and not selective to only inactive experts. There is no feature for loading only the gated experts during inference or swapping experts in/out as needed, and this limitation is explicitly mentioned in user discussions and documentation for models like Mixtral and Qwen3-MoE models in vLLM[^1^][^2^].

Would you like more detail on workarounds or deployment strategies for large MoE models?

Sources:

vLLM Issue #12716

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Enable Expert Offloading General	3	1120	November 11, 2025
Do the current MoE models support setting LoRA adapters on expert layers? General	2	448	October 15, 2025
LoRA integration for Qwen3.5-122b fails during deployment on vLLM 0.17.0 General	3	87	March 18, 2026
Can vLLM return expert selection info to support routing replay? General	1	344	October 16, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	34	July 16, 2025

Expert offloading

Related topics