Speculative decoding in vLLM currently only supports draft models with tensor parallelism (TP) set to 1, not TP > 1. The error you encountered, “Cannot handle cases where distributed draft workers generate no tokens,” is a known limitation when using TP > 1 for the draft model. To use speculative decoding reliably, set the draft model’s tensor parallel size to 1, even if the main model uses TP > 1. This is documented in both the official docs and multiple GitHub issues—see Speculative Decoding docs and related GitHub issue.
If you need to use a larger draft model that doesn’t fit on a single GPU, consider using a smaller or quantized draft model, or wait for future vLLM releases that may add support for TP > 1 in draft models. Would you like more detail on workarounds or configuration examples?
You’re correct that the official docs explicitly state speculative decoding is not compatible with pipeline parallelism (PP), but the situation with tensor parallelism (TP) is more nuanced. By default, vLLM sets the draft model’s TP to match the target model’s TP unless you explicitly set draft_tensor_parallel_size to 1. However, as of now, only draft_tensor_parallel_size=1 is fully supported and stable—using TP>1 for the draft model can lead to errors like the one you encountered, and this is acknowledged in PR #5414 and related discussions.
The PR you referenced (#5414) added infrastructure for supporting different TP sizes between draft and target models, but it was merged with TP>1 for the draft model disabled by default due to unresolved issues. So, while the config allows you to set TP>1, in practice, only TP=1 for the draft model is recommended and supported. Would you like a step-by-step explanation of how to configure this?