MoE quantization

RunLLM · May 8, 2025, 12:20pm

The error you’re seeing (assert self.quant_method is not None in FusedMoE) is a known issue when trying to serve Unsloth dynamic BNB 4-bit quantized Llama-4 Scout models with vLLM. As of now, vLLM does not support BNB INT4 (bitsandbytes 4-bit) quantized MoE (Mixture of Experts) models like Llama-4 Scout, due to the lack of a compatible FusedMoE kernel for bitsandbytes quantization. This is confirmed in recent discussions and issues, where vLLM maintainers and users report the same error and clarify that support is not yet available, despite some users’ hopes or past experiences. You were not hallucinating—there is ongoing work, but currently, these models are not supported in vLLM, and there is no workaround by installing extra packages or dependencies alone[^1][^2][^3].

The only current solution is to use a different quantization format that is supported by vLLM, such as the official RedHatAI Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (INT4) or FP8 quantized models, and ensure you are on the latest nightly version of vLLM. If you specifically need Unsloth’s dynamic BNB-4bit quantization, you will need to wait for upstream support in vLLM. Would you like more detail on how to use the supported quantized models or track support for Unsloth dynamic BNB-4bit in vLLM?

Sources:

Topic		Replies	Views
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	1	226	November 19, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	1222	May 8, 2025
LoRA Adapter enabling with vLLM is not working LoRA	4	579	April 21, 2025
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	509	July 4, 2025
LoRA integration for Qwen3.5-122b fails during deployment on vLLM 0.17.0 General	3	248	March 18, 2026

MoE quantization

Related topics