Vllm rocm gemma4 and MTP

How to run gemma4 with MTP using rocm?
I get always vllm serve: error: argument --speculative-config/-sc: Value {method:mtp,num_speculative_tokens:1} cannot be converted to <function loads at 0x7d0da081b560>

no matter which docker it is, nightly, latest or gemma4 specific.
command in the serve is:
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’"

The error is caused by incorrect quoting or formatting of the --speculative-config argument. vLLM expects a valid JSON string, so you must use straight double quotes (") and not curly or smart quotes (“ ”), and the argument should be passed as a single string. For example:

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This applies to all vLLM Docker images, including ROCm. According to the vLLM documentation, and as shown in the argument parser source, the argument is parsed with json.loads, so invalid JSON or wrong quotes will cause this error. Make sure you use single quotes to wrap the entire JSON string and double quotes inside, and avoid any non-ASCII or smart quotes.

Would you like more detail or troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

That was only copy paste problem, my serve command has it right.
but is the model wrong? QuantTrio/gemma-4-31B-it-AWQ-6Bit

QuantTrio/gemma-4-31B-it-AWQ-6Bit does not support MTP (Multi-Token Prediction) speculative decoding in vLLM. MTP is only supported for Gemma 4 “assistant” checkpoints (e.g., google/gemma-4-31B-it-assistant), not for AWQ-quantized or standard IT models. Attempting to use MTP with this model will result in errors or unsupported behavior, as documented in the MTP guide.

Would you like more detail on which Gemma 4 models support MTP or how to select a compatible model?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes list all 7900 XTX compatible gemma4 31B versions which support mtp

Only the official Gemma 4 31B IT assistant checkpoint (google/gemma-4-31B-it-assistant) supports MTP speculative decoding on ROCm GPUs like the 7900 XTX. AWQ, GPTQ, or other quantized versions do not support MTP; you must use the assistant model as the drafter and the standard IT model as the target (google/gemma-4-31B-it).

Reference: The Gemma 4 MTP documentation and MTP guide specify that only the official assistant checkpoints are supported for MTP, and there is no mention of AWQ or other quantized variants being compatible.

Would you like a sample command for ROCm/7900 XTX?

Sources:

Hint: Mention @RunLLM in the post for followups.

yes give me sample command

Here is a sample command to run Gemma 4 31B with MTP on ROCm (e.g., 7900 XTX), using the official assistant checkpoint:

docker run --rm --ipc=host --network=host --shm-size 16G --device=/dev/kfd --device=/dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest \
  --model google/gemma-4-31B-it \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'

This uses the standard IT model as the target and the assistant as the drafter, as required for MTP. See Gemma 4 MTP documentation for more details.

Sources:

Hint: Mention @RunLLM in the post for followups.

what is wrong with this command, still same error:
docker run -it
–dns=192.168.49.1
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–device /dev/kfd
–device /dev/dri
–shm-size=10g
-e HUGGING_FACE_HUB_TOKEN=“hf”
-e VLLM_SLEEP_WHEN_IDLE=1
-e ROCM_VISIBLE_DEVICES=0,1
-e HIP_VISIBLE_DEVICES=0,1
-e HSA_OVERRIDE_GFX_VERSION=11.0.0
-e PYTORCH_ROCM_ARCH=“gfx1100”
-e VLLM_USE_TRITON_FLASH_ATTN=0
-e GPU_MAX_HW_QUEUES=1
-e NCCL_DEBUG=WARN
-e NCCL_IB_DISABLE=1
–restart=unless-stopped
–name vllm_rocm_emma-4-31B-it-assistant_mtp
–entrypoint /bin/bash
-v /home/ubuntu/vllm_models:/root/.cache/huggingface
-v /home/ubuntu/docker_shared:/app/docker_shared
vllm/vllm-openai-rocm:nightly
-c “sudo apt update && sudo apt-get install -y nano && pip install pandas tqdm scikit-learn &&
vllm serve google/gemma-4-31B-it
–host 0.0.0.0
–port 8000
–served-model-name gemma3
–trust-remote-code
–tensor-parallel-size 2
–attention-backend TRITON_ATTN
–dtype float16
–kv-cache-dtype auto
–max-model-len 3500
–max-num-seqs 15
–max-num-batched-tokens 10000
–gpu-memory-utilization 0.88
–speculative-config ‘{“method”:“mtp”,“model”:“google/gemma-4-31B-it-assistant”,“num_speculative_tokens”:4}’”

Your command uses invalid (smart/curly) quotes (“ ” and ‘ ’) instead of standard ASCII double (") and single (') quotes. vLLM requires valid JSON for --speculative-config, so you must use only straight quotes. Also, ensure all docker flags use double dashes (–), not en-dashes (–).

Example fix (key parts shown):

docker run -it --dns=192.168.49.1 --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri --shm-size=10g \
-e HUGGING_FACE_HUB_TOKEN="hf" ... \
vllm/vllm-openai-rocm:nightly -c "sudo apt update && sudo apt-get install -y nano && pip install pandas tqdm scikit-learn && vllm serve google/gemma-4-31B-it --host 0.0.0.0 --port 8000 --served-model-name gemma3 --trust-remote-code --tensor-parallel-size 2 --attention-backend TRITON_ATTN --dtype float16 --kv-cache-dtype auto --max-model-len 3500 --max-num-seqs 15 --max-num-batched-tokens 10000 --gpu-memory-utilization 0.88 --speculative-config '{\"method\":\"mtp\",\"model\":\"google/gemma-4-31B-it-assistant\",\"num_speculative_tokens\":4}'"

See Gemma 4 MTP documentation for correct syntax.

Sources:

Hint: Mention @RunLLM in the post for followups.