How to load specific quantized model

EyuelDK · July 17, 2025, 6:58am

How can you load a specific quantized model?

For example, I want to load this model https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/tree/main > qwen2.5-0.5b-instruct-q4_k_m.gguf

How can I load this model using vllm?

Thanks.

RunLLM · July 17, 2025, 6:59am

To load a specific GGUF quantized model (e.g., qwen2.5-0.5b-instruct-q4_k_m.gguf), you must first download the .gguf file locally, then point vLLM to the file path. vLLM does not support loading GGUF models directly from a Hugging Face repo; only local single-file GGUF models are supported. Use the base model’s tokenizer for best results.

Example command:

vllm serve ./qwen2.5-0.5b-instruct-q4_k_m.gguf --tokenizer Qwen/Qwen2.5-0.5B-Instruct

If your GGUF model is sharded (multiple files), you must merge them into a single file using the gguf-split tool before loading in vLLM. Multi-file GGUF is not supported natively yet. See official docs for more details.

Would you like a step-by-step guide for downloading and merging GGUF files?

Sources:

Topic		Replies	Views
GGUF quantized models Inference support Quantization	0	160	March 25, 2025
Gemma 3 Quantization General	5	557	June 21, 2025
Support for Deploying 4-bit Fine-Tuned Model with LoRA on vLLM Quantization	13	99	July 30, 2025
How to load the model successfully through multi-card in vllm? General	5	138	April 3, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	48	May 19, 2025

How to load specific quantized model

Related topics