How can you load a specific quantized model?
For example, I want to load this model https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/tree/main
> qwen2.5-0.5b-instruct-q4_k_m.gguf
How can I load this model using vllm?
Thanks.
How can you load a specific quantized model?
For example, I want to load this model https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/tree/main
> qwen2.5-0.5b-instruct-q4_k_m.gguf
How can I load this model using vllm?
Thanks.
To load a specific GGUF quantized model (e.g., qwen2.5-0.5b-instruct-q4_k_m.gguf), you must first download the .gguf file locally, then point vLLM to the file path. vLLM does not support loading GGUF models directly from a Hugging Face repo; only local single-file GGUF models are supported. Use the base model’s tokenizer for best results.
Example command:
vllm serve ./qwen2.5-0.5b-instruct-q4_k_m.gguf --tokenizer Qwen/Qwen2.5-0.5B-Instruct
If your GGUF model is sharded (multiple files), you must merge them into a single file using the gguf-split tool before loading in vLLM. Multi-file GGUF is not supported natively yet. See official docs for more details.
Would you like a step-by-step guide for downloading and merging GGUF files?
Sources: