Vllm rocm gemma4 and MTP

AnttiK · June 12, 2026, 9:02pm

How to run gemma4 with MTP using rocm?
I get always vllm serve: error: argument --speculative-config/-sc: Value {method:mtp,num_speculative_tokens:1} cannot be converted to <function loads at 0x7d0da081b560>

no matter which docker it is, nightly, latest or gemma4 specific.
command in the serve is:
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’"

RunLLM · June 12, 2026, 9:03pm

The error is caused by incorrect quoting or formatting of the --speculative-config argument. vLLM expects a valid JSON string, so you must use straight double quotes (") and not curly or smart quotes (“ ”), and the argument should be passed as a single string. For example:

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This applies to all vLLM Docker images, including ROCm. According to the vLLM documentation, and as shown in the argument parser source, the argument is parsed with json.loads, so invalid JSON or wrong quotes will cause this error. Make sure you use single quotes to wrap the entire JSON string and double quotes inside, and avoid any non-ASCII or smart quotes.

Would you like more detail or troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

AnttiK · June 12, 2026, 9:05pm

That was only copy paste problem, my serve command has it right.
but is the model wrong? QuantTrio/gemma-4-31B-it-AWQ-6Bit

RunLLM · June 12, 2026, 9:05pm

QuantTrio/gemma-4-31B-it-AWQ-6Bit does not support MTP (Multi-Token Prediction) speculative decoding in vLLM. MTP is only supported for Gemma 4 “assistant” checkpoints (e.g., google/gemma-4-31B-it-assistant), not for AWQ-quantized or standard IT models. Attempting to use MTP with this model will result in errors or unsupported behavior, as documented in the MTP guide.

Would you like more detail on which Gemma 4 models support MTP or how to select a compatible model?

Sources:

MTP (Multi-Token Prediction) - vLLM

Hint: Mention @RunLLM in the post for followups.

AnttiK · June 12, 2026, 9:07pm

@RunLLM yes list all 7900 XTX compatible gemma4 31B versions which support mtp

RunLLM · June 12, 2026, 9:07pm

Only the official Gemma 4 31B IT assistant checkpoint (google/gemma-4-31B-it-assistant) supports MTP speculative decoding on ROCm GPUs like the 7900 XTX. AWQ, GPTQ, or other quantized versions do not support MTP; you must use the assistant model as the drafter and the standard IT model as the target (google/gemma-4-31B-it).

Reference: The Gemma 4 MTP documentation and MTP guide specify that only the official assistant checkpoints are supported for MTP, and there is no mention of AWQ or other quantized variants being compatible.

Would you like a sample command for ROCm/7900 XTX?

Sources:

Hint: Mention @RunLLM in the post for followups.

AnttiK · June 12, 2026, 9:08pm

yes give me sample command

RunLLM · June 12, 2026, 9:09pm

Here is a sample command to run Gemma 4 31B with MTP on ROCm (e.g., 7900 XTX), using the official assistant checkpoint:

docker run --rm --ipc=host --network=host --shm-size 16G --device=/dev/kfd --device=/dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest \
  --model google/gemma-4-31B-it \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'

This uses the standard IT model as the target and the assistant as the drafter, as required for MTP. See Gemma 4 MTP documentation for more details.

Sources:

Gemma 4 MTP documentation

Hint: Mention @RunLLM in the post for followups.

AnttiK · June 12, 2026, 9:15pm

what is wrong with this command, still same error:
docker run -it
–dns=192.168.49.1
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–device /dev/kfd
–device /dev/dri
–shm-size=10g
-e HUGGING_FACE_HUB_TOKEN=“hf”
-e VLLM_SLEEP_WHEN_IDLE=1
-e ROCM_VISIBLE_DEVICES=0,1
-e HIP_VISIBLE_DEVICES=0,1
-e HSA_OVERRIDE_GFX_VERSION=11.0.0
-e PYTORCH_ROCM_ARCH=“gfx1100”
-e VLLM_USE_TRITON_FLASH_ATTN=0
-e GPU_MAX_HW_QUEUES=1
-e NCCL_DEBUG=WARN
-e NCCL_IB_DISABLE=1
–restart=unless-stopped
–name vllm_rocm_emma-4-31B-it-assistant_mtp
–entrypoint /bin/bash
-v /home/ubuntu/vllm_models:/root/.cache/huggingface
-v /home/ubuntu/docker_shared:/app/docker_shared
vllm/vllm-openai-rocm:nightly
-c “sudo apt update && sudo apt-get install -y nano && pip install pandas tqdm scikit-learn &&
vllm serve google/gemma-4-31B-it
–host 0.0.0.0
–port 8000
–served-model-name gemma3
–trust-remote-code
–tensor-parallel-size 2
–attention-backend TRITON_ATTN
–dtype float16
–kv-cache-dtype auto
–max-model-len 3500
–max-num-seqs 15
–max-num-batched-tokens 10000
–gpu-memory-utilization 0.88
–speculative-config ‘{“method”:“mtp”,“model”:“google/gemma-4-31B-it-assistant”,“num_speculative_tokens”:4}’”

RunLLM · June 12, 2026, 9:15pm

Your command uses invalid (smart/curly) quotes (“ ” and ‘ ’) instead of standard ASCII double (") and single (') quotes. vLLM requires valid JSON for --speculative-config, so you must use only straight quotes. Also, ensure all docker flags use double dashes (–), not en-dashes (–).

Example fix (key parts shown):

docker run -it --dns=192.168.49.1 --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri --shm-size=10g \
-e HUGGING_FACE_HUB_TOKEN="hf" ... \
vllm/vllm-openai-rocm:nightly -c "sudo apt update && sudo apt-get install -y nano && pip install pandas tqdm scikit-learn && vllm serve google/gemma-4-31B-it --host 0.0.0.0 --port 8000 --served-model-name gemma3 --trust-remote-code --tensor-parallel-size 2 --attention-backend TRITON_ATTN --dtype float16 --kv-cache-dtype auto --max-model-len 3500 --max-num-seqs 15 --max-num-batched-tokens 10000 --gpu-memory-utilization 0.88 --speculative-config '{\"method\":\"mtp\",\"model\":\"google/gemma-4-31B-it-assistant\",\"num_speculative_tokens\":4}'"

See Gemma 4 MTP documentation for correct syntax.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How to use Gemma 4 with the new MTP drafters? General	8	2929	May 14, 2026
Why latest rocm vllm is so bad? General	3	333	December 14, 2025
OOM Trying to run Gemma 4 31B NVFP4 on 2x16GB Model Support	4	138	May 31, 2026
Trying to run gemma-3-27b-it-FP8-dynamic with rocm General	23	1024	July 25, 2025
[gpt_oss_triton_kernels_moe.py:59] Using legacy triton_kernels on ROCm General	1	86	March 24, 2026

Vllm rocm gemma4 and MTP

Related topics