GGUF quantized models Inference support

sneha5gsm · March 25, 2025, 1:11pm

Hello
I wanted to run some model evaluations on a GGUF quantized llama 3 based model using vLLM to load the model. The current latest documentation suggests that GGUF support in vLLM is highly experimental and under-optimized at the moment. What does this mean? Does this mean the model quality (generation quality) itself could degrade or does it mean the inference optimizations related to latencies and throughput are not yet supported?

Thanks.

Topic		Replies	Views
How to use speculative decoding? Speculative Decoding	3	77	May 1, 2025
Pipeline Parallelism Support - Source Code Location Features	1	70	March 25, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	26	May 19, 2025
Will vLLM follow-up DeepSeek's inference system DeepSeek	3	179	May 13, 2025
I published a performance test result of vllm vs sglang but can someone help me explain it? General	3	93	April 29, 2025

GGUF quantized models Inference support

Related topics