Cannot handle cases where distributed draft workers generate no tokens

Hello, vLLM team! I meet an error when I use draft model with tp=2, command is:

MODEL='/data/framework_vllm/models/Qwen3-32B'
SPEC_MODEL='/data/framework_vllm/models/Qwen3-1.7B'
NUM_SPEC_TOKENS=3
SPEC_CFG="{\"model\": \"$SPEC_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}"

DEVICE_IDS=0,1
N_GPUS=2

CUDA_VISIBLE_DEVICES=${DEVICE_IDS} \
vllm serve $MODEL \
    --swap-space 16 \
    --disable-log-requests \
    --speculative_config "$SPEC_CFG" \
    --tensor_parallel_size ${N_GPUS} \

most time, it servers normly. But sometimes, it reports an error:

ERROR 07-24 02:12:54 [engine.py:165]     return self._run_speculative_decoding_step(execute_model_req,
ERROR 07-24 02:12:54 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 02:12:54 [engine.py:165]   File "/root/anaconda3/envs/vllm_sxh/lib/python3.12/contextlib.py", line 81, in inner
ERROR 07-24 02:12:54 [engine.py:165]     return func(*args, **kwds)
ERROR 07-24 02:12:54 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^
ERROR 07-24 02:12:54 [engine.py:165]   File "/root/anaconda3/envs/vllm_sxh/lib/python3.12/site-packages/vllm/spec_decode/spec_decode_worker.py", line 790, in _run_speculative_decoding_step
ERROR 07-24 02:12:54 [engine.py:165]     raise RuntimeError("Cannot handle cases where distributed draft "
ERROR 07-24 02:12:54 [engine.py:165] RuntimeError: Cannot handle cases where distributed draft workers generate no tokens

what should I do if I want to keep draft model on tp = 2 ?

Speculative decoding in vLLM currently only supports draft models with tensor parallelism (TP) set to 1, not TP > 1. The error you encountered, “Cannot handle cases where distributed draft workers generate no tokens,” is a known limitation when using TP > 1 for the draft model. To use speculative decoding reliably, set the draft model’s tensor parallel size to 1, even if the main model uses TP > 1. This is documented in both the official docs and multiple GitHub issues—see Speculative Decoding docs and related GitHub issue.

If you need to use a larger draft model that doesn’t fit on a single GPU, consider using a smaller or quantized draft model, or wait for future vLLM releases that may add support for TP > 1 in draft models. Would you like more detail on workarounds or configuration examples?

Sources:

Ok, thanks!But the official docshttps://docs.vllm.ai/en/stable/features/spec_decode.html only say not support PP now.

Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.

Besides, default parallel policy is: draft model has the same tp size as target model? https://github.com/vllm-project/vllm/pull/5414. This pr also confuses me.

You’re correct that the official docs explicitly state speculative decoding is not compatible with pipeline parallelism (PP), but the situation with tensor parallelism (TP) is more nuanced. By default, vLLM sets the draft model’s TP to match the target model’s TP unless you explicitly set draft_tensor_parallel_size to 1. However, as of now, only draft_tensor_parallel_size=1 is fully supported and stable—using TP>1 for the draft model can lead to errors like the one you encountered, and this is acknowledged in PR #5414 and related discussions.

The PR you referenced (#5414) added infrastructure for supporting different TP sizes between draft and target models, but it was merged with TP>1 for the draft model disabled by default due to unresolved issues. So, while the config allows you to set TP>1, in practice, only TP=1 for the draft model is recommended and supported. Would you like a step-by-step explanation of how to configure this?

Sources:

Thanks very much! I know how to configure it. Thanks!