Is there any way to change the settings of fused_experts? It appears the engine automatically detected the configuration and set the MoE backend to FlashInferExperts.
The error occurs because vLLM’s LoRA support for Fused MoE currently expects the expert kernel to be TritonExperts, but your configuration is using FlashInferExperts, which is not compatible with the current LoRA injection logic. There is no documented or supported way to force the backend to TritonExperts if vLLM auto-selects FlashInferExperts; this selection is based on hardware and model configuration, and overriding it is not exposed as a user setting in the data sources.
To resolve this, you would need to use a configuration (hardware, model, or quantization) that causes vLLM to select TritonExperts as the backend, or wait for upstream support for LoRA with FlashInferExperts. There is no supported runtime flag or documented environment variable to force the backend for Fused MoE LoRA in the current vLLM releases, and attempts to bypass the assertion may result in further incompatibility or errors. Would you like more detail on how vLLM selects MoE backends or possible workarounds for your use case?
The MoE backend in vLLM is decided automatically based on the model architecture, quantization type, and hardware. By default, the setting is moe_backend=“auto”, which lets vLLM select the most suitable backend (e.g., Triton, DeepGEMM, Cutlass, FlashInfer, Marlin, AITer) for your environment and model. You can override this by explicitly setting the --moe-backend CLI flag or the corresponding config field to a specific backend name if needed.
Backend selection logic is not fully user-configurable beyond these options; vLLM internally checks compatibility and availability before finalizing the backend. For details and available options, see KernelConfig documentation and CLI reference. Would you like more detail on the backend selection process or the meaning of each backend?