GLM 5 / Kimi k2.5 on 4 x RTX 6000 Pro

valn · March 22, 2026, 3:09pm

hi. Is there any way to get those models running via VLLM on such hardware? They “almost” have enough ram for AWQ or similar quants but not quite. It would be a killer feature to be able to run them.

One way would be supporting ubergarm quants, which are highly functional even past 100k contexts. Included a few that been testing for awhile via ik_llama.

IQ3_KS 320.216 GiB (3.649 BPW)

IQ2_KL 261.988 GiB (2.985 BPW)

smol-IQ2_KL 329.195 GiB (2.755 BPW)

RunLLM · March 22, 2026, 3:09pm

Currently, vLLM supports running GGUF models, but only for single-file GGUF checkpoints and specific quantization types like Q3_K_M and Q3_K_S, not all custom or highly compressed formats such as IQ2_KL or IQ3_KS from ubergarm. There is ongoing discussion and feature requests to add optimized support for more low-bit GGUF quantizations (like those used by ubergarm), but as of now, these are not fully supported in vLLM and may not run reliably or at all, especially for very large context windows or custom quant types (see discussion, GGUF docs).

If your hardware is just below the RAM threshold for AWQ or similar, you can try using supported GGUF quantizations (e.g., Q3_K_M) by downloading a compatible model and serving it with vLLM, but for ubergarm’s custom quant types (IQ2_KL, IQ3_KS), vLLM does not yet provide official support or optimized kernels. Feature requests for these formats are open, and contributions are welcome, but there is no timeline for full support (see feature request). Would you like more detail on current GGUF support or alternative approaches?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
A bit of frustration with Quantization Quantization	5	747	October 14, 2025
How to load specific quantized model General	1	1506	July 17, 2025
Gemma 3 Quantization General	5	1552	June 21, 2025
GGUF quantized models Inference support Quantization	0	307	March 25, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	39	July 16, 2025

GLM 5 / Kimi k2.5 on 4 x RTX 6000 Pro

Related topics