How to use speculative decoding?

AIenthusiast123 · May 1, 2025, 5:42pm

Hey there!

So I am interested in finding out about how speculative decoding works in vLLM, specifically using draft models in conjunction with gguf models.

I tried Qwen3 32B Q4k_m and added a speculative decoding config, in which I picked the Qwen3 1.7B Q6k_m. I also tried the non gguf Qwen3 1.7B model, but I encounter an „ValueError: … is not a model.“

So does speculative decoding even work with gguf main models and if yes, can/ must not the draft model be a gguf model?

It would be nice to find some reliable information regarding this topic

RunLLM · May 1, 2025, 5:43pm

Speculative decoding in vLLM currently does not support GGUF models as either the main (target) or draft model. According to vLLM issue #1002, GGUF support is not available in upstream vLLM, and community efforts to add GGUF support are experimental and not merged. All official documentation and examples for speculative decoding use HuggingFace (transformers) format models, not GGUF. There is no mention in the official docs or codebase of speculative decoding with GGUF models, and errors like “ValueError: … is not a model” are expected if you try to use GGUF models in this context.

If you want to use speculative decoding, both the main and draft models must be in a supported format (e.g., HuggingFace transformers, not GGUF). Would you like more detail on how to set up speculative decoding with supported models?

Sources:

vLLM issue #1002

aarnphm · May 1, 2025, 9:39pm

Hey there, currently you can specify the the method via --speculative-config

For example, to use ngram with qwen3:

vllm serve Qwen/Qwen3-30B-A3B --max-model-len 16384 --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5}'

We are working on a documentation for this, but you can check out this class vllm/vllm/config.py at 9b70e2b4c147ea650f9b943e6aecd977377fbbfd · vllm-project/vllm · GitHub

Essentially the JSON string translate to fields within this class

RunLLM · May 1, 2025, 9:40pm

Yes, you can specify the speculative decoding method (like ngram) using the --speculative-config flag, passing a JSON string with parameters such as method, num_speculative_tokens, and prompt_lookup_max. This configuration is parsed into the SpeculativeConfig class, which defines all the available options for speculative decoding in vLLM. For example, your command is correct for using ngram speculation with Qwen3 models.

You can find the full list of configurable parameters and their descriptions in the SpeculativeConfig class in vllm/config.py. Would you like a breakdown of all available options or example configurations for other methods?

Sources:

Topic		Replies	Views
Does the vLLM v1 support Speculative Decoding now? V1 Feedback	4	308	March 25, 2025
Improving Speculative Decoding for Beginning Tokens & Structured Output Speculative Decoding	1	108	July 16, 2025
Goodput Guided Speculative Decoding Speculative Decoding	2	188	April 19, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	55	July 24, 2025
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	106	July 29, 2025

How to use speculative decoding?

Related topics