RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100)

When I used the GPQT method of the llmcompressor library to perform int8 quantization on Qwen3-VL-4B with an RTX 5090 graphics card, and ran inference using vllm version 0.11.0, the following error occurred: RuntimeError: Int8 not supported for this architecture.
However, it works normally on an RTX 4090 graphics card.

vLLM当前的INT8量化(W8A8)在RTX 5090(Blackwell架构,SM 120)上不被支持,因此会报“RuntimeError: Int8 not supported for this architecture”。但在RTX 4090(Ada架构,SM 89)上是支持的,所以能正常运行。Blackwell架构需使用FP8量化,INT8支持尚未实现,详见官方文档相关issue

如需在RTX 5090上运行量化模型,建议使用FP8量化模型,或等待vLLM后续对Blackwell架构的INT8支持更新。官方说明已明确INT8不支持Blackwell,建议切换量化方案。

需要更详细的技术分析或替代方案吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like