Your error NotImplementedError: Speculative decoding with draft model is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp. means that vLLM is interpreting your config as a “draft model” setup, which is not supported for speculative decoding. For GLM-4.7-Flash, you must use the mtp method, but the config must be passed as a single JSON string to --speculative-config, not as separate --speculative-config.method and --speculative-config.num_speculative_tokens flags.
How to fix:
Replace
--speculative-config.method mtp --speculative-config.num_speculative_tokens 1
with
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
in your docker run command. This is the only supported way to pass speculative decoding config in recent vLLM versions (see docs).
Example:
docker run -d --gpus all --ipc=host --network host -p 8000:8000 \
-v /home/ubuntu/vllm_models:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="hf_UhSesM" \
--shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 \
--restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated \
vllm serve zai-org/GLM-4.7-Flash \
--served-model-name gemma3 \
--dtype auto \
--max-model-len 4096 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-chunked-prefill \
--swap-space 12 \
--tensor-parallel-size 2 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--trust-remote-code \
--enable-auto-tool-choice \
--max-num-batched-tokens 3000
Let me know if you want a full working Dockerfile or further troubleshooting steps.
Sources:
- zai-org/GLM-4.7-Flash · Hugging Face
- Speculative Decoding Example (vLLM)
- GitHub Issue: NotImplementedError: Speculative decoding with draft model is not supported yet
- see docs
Hint: Mention @RunLLM in the post for followups.