GLM 5 / Kimi k2.5 on 4 x RTX 6000 Pro

hi. Is there any way to get those models running via VLLM on such hardware? They “almost” have enough ram for AWQ or similar quants but not quite. It would be a killer feature to be able to run them.

One way would be supporting ubergarm quants, which are highly functional even past 100k contexts. Included a few that been testing for awhile via ik_llama.

IQ3_KS 320.216 GiB (3.649 BPW)

IQ2_KL 261.988 GiB (2.985 BPW)

smol-IQ2_KL 329.195 GiB (2.755 BPW)

Currently, vLLM supports running GGUF models, but only for single-file GGUF checkpoints and specific quantization types like Q3_K_M and Q3_K_S, not all custom or highly compressed formats such as IQ2_KL or IQ3_KS from ubergarm. There is ongoing discussion and feature requests to add optimized support for more low-bit GGUF quantizations (like those used by ubergarm), but as of now, these are not fully supported in vLLM and may not run reliably or at all, especially for very large context windows or custom quant types (see discussion, GGUF docs).

If your hardware is just below the RAM threshold for AWQ or similar, you can try using supported GGUF quantizations (e.g., Q3_K_M) by downloading a compatible model and serving it with vLLM, but for ubergarm’s custom quant types (IQ2_KL, IQ3_KS), vLLM does not yet provide official support or optimized kernels. Feature requests for these formats are open, and contributions are welcome, but there is no timeline for full support (see feature request). Would you like more detail on current GGUF support or alternative approaches?

Sources:

Hint: Mention @RunLLM in the post for followups.