I was able to serve [EDIT: no, i was not able to serve this] LLama 4 Scout 17b 16e unsloth dynanmic bnb 4bit prior to an nvidia driver update forcing me to a new docker image + container + new vllm download build + install. Now I get this error that I have found other people get, too:
ERROR 04-09 19:22:27 [core.py:386] File “/data1/yinjian/python-scripts/vllm/vllm/model_executor/models/llama4.py”, line 73, in init
ERROR 04-09 19:22:27 [core.py:386] self.experts = FusedMoE(
ERROR 04-09 19:22:27 [core.py:386] ^^^^^^^^^
ERROR 04-09 19:22:27 [core.py:386] File “/data1/yinjian/python-scripts/vllm/vllm/model_executor/layers/fused_moe/layer.py”, line 502, in init
ERROR 04-09 19:22:27 [core.py:386] assert self.quant_method is not None
ERROR 04-09 19:22:27 [core.py:386] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
One online discussion of the above error references year-old information about vllm not supporting MoE + quantization. I think that must not be the case anymore since my previous incarnation of vllm was able to serve it. So i don’t think that error really represents whatever the issue is that’s breaking it.
Info over on Unsloth speaks of Unsloth’s own quantization being in play. I am guessing maybe I had installed something (beyond installing just plain unsloth) in my old container for vllm, but I now I can’t figure what it was and I’m starting to think maybe I was halluincating that I ever had vllm serving unsloth’s Llama4 Scout, like maybe it was a dream.
The error you’re seeing (assert self.quant_method is not None in FusedMoE) is a known issue when trying to serve Unsloth dynamic BNB 4-bit quantized Llama-4 Scout models with vLLM. As of now, vLLM does not support BNB INT4 (bitsandbytes 4-bit) quantized MoE (Mixture of Experts) models like Llama-4 Scout, due to the lack of a compatible FusedMoE kernel for bitsandbytes quantization. This is confirmed in recent discussions and issues, where vLLM maintainers and users report the same error and clarify that support is not yet available, despite some users’ hopes or past experiences. You were not hallucinating—there is ongoing work, but currently, these models are not supported in vLLM, and there is no workaround by installing extra packages or dependencies alone[^1][^2][^3].
The only current solution is to use a different quantization format that is supported by vLLM, such as the official RedHatAI Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (INT4) or FP8 quantized models, and ensure you are on the latest nightly version of vLLM. If you specifically need Unsloth’s dynamic BNB-4bit quantization, you will need to wait for upstream support in vLLM. Would you like more detail on how to use the supported quantized models or track support for Unsloth dynamic BNB-4bit in vLLM?
Thank you runbot. I was actually hallucinating that I’d run this model on vllm in my previous incarnation container. That’s not where it ran successfully after all. But thank you for the tip about RedHatAI’s llama4 scout(s) i will check that out
You’re welcome! Yes, for Llama-4 Scout models, the RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic is confirmed to work with vLLM when using supported quantization and the latest vLLM version[^1][^2]. If you need help with the exact vLLM command or setup, let me know.
Would you like a step-by-step example for serving the RedHatAI Llama-4 Scout model with vLLM?