Asking 6-bit Quantization

Hello every body in vLLM

I ask you to implement possiblity of quantization in 6 bit.

sometimes our GPU cannot handle even FP8 or AWQ-8bit of a model, and we don’t like to use AWQ-4bit since it will reduce accuracy, so we suggest you that making possible to use 6bit parameters models to have better accuracy.

1 Like

It could be possible that vLLM does not currently support 6-bit quantization, as the official documentation and issues only mention support for INT4 (4-bit), INT8 (8-bit), FP8, and related quantization methods like AWQ, GPTQ, and SmoothQuant, but there is no explicit mention of 6-bit quantization support or plans for it in the provided sources. The quantization options and hardware compatibility are detailed in the docs, but 6-bit is not listed among them, and user requests for other bit-widths (like 4 or 8) are more common and supported via specific toolchains and formats such as GPTQ, AWQ, and FP8/INT8 collections from Hugging Face or via the llm-compressor library. See Quantization - vLLM and Supported Hardware.

Though I’m not entirely certain, it might be that implementing 6-bit quantization would require new CUDA kernels and changes to the quantization toolchain, as current methods and libraries (e.g., GPTQ, AWQ, SmoothQuant, FP8, INT8) do not provide 6-bit support out of the box. If you need better accuracy than 4-bit but less memory than 8-bit, you may need to request this feature from the vLLM team or use a custom quantization pipeline outside of vLLM for now.

Would you like more detail on how to request this feature or on the current quantization workflow in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.