GLM-4.7-Flash with nvidia

RunLLM · January 22, 2026, 2:22am

Your error NotImplementedError: Speculative decoding with draft model is not supported yet. Please consider using other speculative decoding methods such as ngram, medusa, eagle, or mtp. means that vLLM is interpreting your config as a “draft model” setup, which is not supported for speculative decoding. For GLM-4.7-Flash, you must use the mtp method, but the config must be passed as a single JSON string to --speculative-config, not as separate --speculative-config.method and --speculative-config.num_speculative_tokens flags.

How to fix:
Replace

--speculative-config.method mtp --speculative-config.num_speculative_tokens 1

with

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

in your docker run command. This is the only supported way to pass speculative decoding config in recent vLLM versions (see docs).

Example:

docker run -d --gpus all --ipc=host --network host -p 8000:8000 \
  -v /home/ubuntu/vllm_models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN="hf_UhSesM" \
  --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 \
  --restart unless-stopped --name vllm_GLM-4.7-Flash vllm-openai-updated \
  vllm serve zai-org/GLM-4.7-Flash \
    --served-model-name gemma3 \
    --dtype auto \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --swap-space 12 \
    --tensor-parallel-size 2 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --max-num-batched-tokens 3000

Let me know if you want a full working Dockerfile or further troubleshooting steps.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Add Support for GLM-4 series model Model Support	1	184	April 16, 2025
Not able to run GLM-4.5-Air on rocm 7.0 with 2x 7900 xtx General	1	298	October 17, 2025
Make install easier General	11	649	July 24, 2025
Why latest rocm vllm is so bad? General	3	319	December 14, 2025
"served-model-name" and "model" General	6	1060	January 26, 2026

GLM-4.7-Flash with nvidia

Related topics