How to use speculative decoding?

Hey there!

So I am interested in finding out about how speculative decoding works in vLLM, specifically using draft models in conjunction with gguf models.

I tried Qwen3 32B Q4k_m and added a speculative decoding config, in which I picked the Qwen3 1.7B Q6k_m. I also tried the non gguf Qwen3 1.7B model, but I encounter an „ValueError: … is not a model.“

So does speculative decoding even work with gguf main models and if yes, can/ must not the draft model be a gguf model?

It would be nice to find some reliable information regarding this topic :slight_smile:

Speculative decoding in vLLM currently does not support GGUF models as either the main (target) or draft model. According to vLLM issue #1002, GGUF support is not available in upstream vLLM, and community efforts to add GGUF support are experimental and not merged. All official documentation and examples for speculative decoding use HuggingFace (transformers) format models, not GGUF. There is no mention in the official docs or codebase of speculative decoding with GGUF models, and errors like “ValueError: … is not a model” are expected if you try to use GGUF models in this context.

If you want to use speculative decoding, both the main and draft models must be in a supported format (e.g., HuggingFace transformers, not GGUF). Would you like more detail on how to set up speculative decoding with supported models?

Sources:

Hey there, currently you can specify the the method via --speculative-config

For example, to use ngram with qwen3:

vllm serve Qwen/Qwen3-30B-A3B --max-model-len 16384 --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5}'

We are working on a documentation for this, but you can check out this class vllm/vllm/config.py at 9b70e2b4c147ea650f9b943e6aecd977377fbbfd · vllm-project/vllm · GitHub

Essentially the JSON string translate to fields within this class

Yes, you can specify the speculative decoding method (like ngram) using the --speculative-config flag, passing a JSON string with parameters such as method, num_speculative_tokens, and prompt_lookup_max. This configuration is parsed into the SpeculativeConfig class, which defines all the available options for speculative decoding in vLLM. For example, your command is correct for using ngram speculation with Qwen3 models.

You can find the full list of configurable parameters and their descriptions in the SpeculativeConfig class in vllm/config.py. Would you like a breakdown of all available options or example configurations for other methods?

Sources: