Cannot handle cases where distributed draft workers generate no tokens

Sun-Xiaohui · July 24, 2025, 2:30am

Hello, vLLM team! I meet an error when I use draft model with tp=2, command is:

MODEL='/data/framework_vllm/models/Qwen3-32B'
SPEC_MODEL='/data/framework_vllm/models/Qwen3-1.7B'
NUM_SPEC_TOKENS=3
SPEC_CFG="{\"model\": \"$SPEC_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}"

DEVICE_IDS=0,1
N_GPUS=2

CUDA_VISIBLE_DEVICES=${DEVICE_IDS} \
vllm serve $MODEL \
    --swap-space 16 \
    --disable-log-requests \
    --speculative_config "$SPEC_CFG" \
    --tensor_parallel_size ${N_GPUS} \

most time, it servers normly. But sometimes, it reports an error:

ERROR 07-24 02:12:54 [engine.py:165]     return self._run_speculative_decoding_step(execute_model_req,
ERROR 07-24 02:12:54 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 02:12:54 [engine.py:165]   File "/root/anaconda3/envs/vllm_sxh/lib/python3.12/contextlib.py", line 81, in inner
ERROR 07-24 02:12:54 [engine.py:165]     return func(*args, **kwds)
ERROR 07-24 02:12:54 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^
ERROR 07-24 02:12:54 [engine.py:165]   File "/root/anaconda3/envs/vllm_sxh/lib/python3.12/site-packages/vllm/spec_decode/spec_decode_worker.py", line 790, in _run_speculative_decoding_step
ERROR 07-24 02:12:54 [engine.py:165]     raise RuntimeError("Cannot handle cases where distributed draft "
ERROR 07-24 02:12:54 [engine.py:165] RuntimeError: Cannot handle cases where distributed draft workers generate no tokens

what should I do if I want to keep draft model on tp = 2 ?

RunLLM · July 24, 2025, 2:31am

Speculative decoding in vLLM currently only supports draft models with tensor parallelism (TP) set to 1, not TP > 1. The error you encountered, “Cannot handle cases where distributed draft workers generate no tokens,” is a known limitation when using TP > 1 for the draft model. To use speculative decoding reliably, set the draft model’s tensor parallel size to 1, even if the main model uses TP > 1. This is documented in both the official docs and multiple GitHub issues—see Speculative Decoding docs and related GitHub issue.

If you need to use a larger draft model that doesn’t fit on a single GPU, consider using a smaller or quantized draft model, or wait for future vLLM releases that may add support for TP > 1 in draft models. Would you like more detail on workarounds or configuration examples?

Sources:

Sun-Xiaohui · July 24, 2025, 3:02am

Ok, thanks！But the official docshttps://docs.vllm.ai/en/stable/features/spec_decode.html only say not support PP now.

Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.

Besides, default parallel policy is: draft model has the same tp size as target model? https://github.com/vllm-project/vllm/pull/5414. This pr also confuses me.

RunLLM · July 24, 2025, 3:02am

You’re correct that the official docs explicitly state speculative decoding is not compatible with pipeline parallelism (PP), but the situation with tensor parallelism (TP) is more nuanced. By default, vLLM sets the draft model’s TP to match the target model’s TP unless you explicitly set draft_tensor_parallel_size to 1. However, as of now, only draft_tensor_parallel_size=1 is fully supported and stable—using TP>1 for the draft model can lead to errors like the one you encountered, and this is acknowledged in PR #5414 and related discussions.

The PR you referenced (#5414) added infrastructure for supporting different TP sizes between draft and target models, but it was merged with TP>1 for the draft model disabled by default due to unresolved issues. So, while the config allows you to set TP>1, in practice, only TP=1 for the draft model is recommended and supported. Would you like a step-by-step explanation of how to configure this?

Sources:

Sun-Xiaohui · July 24, 2025, 3:13am

Thanks very much! I know how to configure it. Thanks!

Topic		Replies	Views
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	128	July 29, 2025
Standalone draft model spec decode support in v0.x and v1 Speculative Decoding	3	47	January 20, 2026
How to use speculative decoding? Speculative Decoding	3	674	May 1, 2025
Speculative decoding is currently incompatible with pipeline parallelism General	3	700	July 8, 2025
Why is it so slow to build a odeVLLM from source using Docker? General	39	74	January 17, 2026

Cannot handle cases where distributed draft workers generate no tokens

Related topics