MoE quantization

The error you’re seeing (assert self.quant_method is not None in FusedMoE) is a known issue when trying to serve Unsloth dynamic BNB 4-bit quantized Llama-4 Scout models with vLLM. As of now, vLLM does not support BNB INT4 (bitsandbytes 4-bit) quantized MoE (Mixture of Experts) models like Llama-4 Scout, due to the lack of a compatible FusedMoE kernel for bitsandbytes quantization. This is confirmed in recent discussions and issues, where vLLM maintainers and users report the same error and clarify that support is not yet available, despite some users’ hopes or past experiences. You were not hallucinating—there is ongoing work, but currently, these models are not supported in vLLM, and there is no workaround by installing extra packages or dependencies alone[^1][^2][^3].

The only current solution is to use a different quantization format that is supported by vLLM, such as the official RedHatAI Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (INT4) or FP8 quantized models, and ensure you are on the latest nightly version of vLLM. If you specifically need Unsloth’s dynamic BNB-4bit quantization, you will need to wait for upstream support in vLLM. Would you like more detail on how to use the supported quantized models or track support for Unsloth dynamic BNB-4bit in vLLM?

Sources: