How to load specific quantized model

How can you load a specific quantized model?

For example, I want to load this model https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/tree/main > qwen2.5-0.5b-instruct-q4_k_m.gguf

How can I load this model using vllm?

Thanks.

To load a specific GGUF quantized model (e.g., qwen2.5-0.5b-instruct-q4_k_m.gguf), you must first download the .gguf file locally, then point vLLM to the file path. vLLM does not support loading GGUF models directly from a Hugging Face repo; only local single-file GGUF models are supported. Use the base model’s tokenizer for best results.

Example command:

vllm serve ./qwen2.5-0.5b-instruct-q4_k_m.gguf --tokenizer Qwen/Qwen2.5-0.5B-Instruct

If your GGUF model is sharded (multiple files), you must merge them into a single file using the gguf-split tool before loading in vLLM. Multi-file GGUF is not supported natively yet. See official docs for more details.

Would you like a step-by-step guide for downloading and merging GGUF files?

Sources: