does vllm currently support expert offloading for moe models?
vLLM does not currently support on-demand or dynamic expert offloading for MoE models; all experts must be loaded into memory at initialization. If the model does not fit in GPU memory, you can use the cpu_offload_gb parameter to offload some weights to CPU, but this is slow and not selective to only inactive experts. There is no feature for loading only the gated experts during inference or swapping experts in/out as needed, and this limitation is explicitly mentioned in user discussions and documentation for models like Mixtral and Qwen3-MoE models in vLLM[^1^][^2^].
Would you like more detail on workarounds or deployment strategies for large MoE models?
Sources:
Hint: Mention @RunLLM in the post for followups.