LoRA integration for Qwen3.5-122b fails during deployment on vLLM 0.17.0

hqinac · March 18, 2026, 6:03am

logging shows:

vllm-qwen35-122b | (Worker pid=613) (Worker_TP3 pid=613) ERROR 03-18 02:10:46 [multiproc_executor.py:800] File “/opt/venv/lib/python3.12/site-packages/vllm/lora/layers/fused_moe.py”, line 161, in _inject_lora_into_fused_moe
vllm-qwen35-122b | (Worker pid=613) (Worker_TP3 pid=613) ERROR 03-18 02:10:46 [multiproc_executor.py:800] assert isinstance(m_fused_moe_fn.impl.fused_experts, TritonExperts)
vllm-qwen35-122b | (Worker pid=613) (Worker_TP3 pid=613) ERROR 03-18 02:10:46 [multiproc_executor.py:800] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-qwen35-122b | (Worker pid=613) (Worker_TP3 pid=613) ERROR 03-18 02:10:46 [multiproc_executor.py:800] AssertionError

Is there any way to change the settings of fused_experts? It appears the engine automatically detected the configuration and set the MoE backend to FlashInferExperts.

Thanks for any help.

RunLLM · March 18, 2026, 6:04am

The error occurs because vLLM’s LoRA support for Fused MoE currently expects the expert kernel to be TritonExperts, but your configuration is using FlashInferExperts, which is not compatible with the current LoRA injection logic. There is no documented or supported way to force the backend to TritonExperts if vLLM auto-selects FlashInferExperts; this selection is based on hardware and model configuration, and overriding it is not exposed as a user setting in the data sources.

To resolve this, you would need to use a configuration (hardware, model, or quantization) that causes vLLM to select TritonExperts as the backend, or wait for upstream support for LoRA with FlashInferExperts. There is no supported runtime flag or documented environment variable to force the backend for Fused MoE LoRA in the current vLLM releases, and attempts to bypass the assertion may result in further incompatibility or errors. Would you like more detail on how vLLM selects MoE backends or possible workarounds for your use case?

Sources:

Hint: Mention @RunLLM in the post for followups.

hqinac · March 18, 2026, 9:06am

@RunLLM tell me how the moe backend is decided

RunLLM · March 18, 2026, 9:06am

The MoE backend in vLLM is decided automatically based on the model architecture, quantization type, and hardware. By default, the setting is moe_backend=“auto”, which lets vLLM select the most suitable backend (e.g., Triton, DeepGEMM, Cutlass, FlashInfer, Marlin, AITer) for your environment and model. You can override this by explicitly setting the --moe-backend CLI flag or the corresponding config field to a specific backend name if needed.

Backend selection logic is not fully user-configurable beyond these options; vLLM internally checks compatibility and availability before finalizing the backend. For details and available options, see KernelConfig documentation and CLI reference. Would you like more detail on the backend selection process or the meaning of each backend?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Do the current MoE models support setting LoRA adapters on expert layers? General	2	459	October 15, 2025
Expert offloading Features	1	445	November 11, 2025
LoRA Adapter enabling with vLLM is not working LoRA	4	516	April 21, 2025
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM Quantization	13	763	July 30, 2025
MoE quantization Quantization	9	1208	July 2, 2025

LoRA integration for Qwen3.5-122b fails during deployment on vLLM 0.17.0

Related topics