Speculative decoding is currently incompatible with pipeline parallelism

Rus-P · July 8, 2025, 6:19am

Is there any plan to support deepseek_mtp when PP>1?Or any workaround patch

RunLLM · July 8, 2025, 6:19am

Yes, vLLM now supports DeepSeek MTP (multi-token prediction) with pipeline parallelism (PP > 1). You can enable this by setting --speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 1}' along with your desired --pipeline-parallel-size and --tensor-parallel-size when launching vLLM. There is no need to export or import the MTP module separately; DeepSeek model weights already include the MTP layer. This feature has been tested and confirmed to work with PP > 1 on DeepSeek-R1 models, and is compatible with both tensor and pipeline parallelism according to recent merged PRs and user confirmations in the vLLM repository. For example:

vllm serve deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 1}'

Would you like more detail on configuration or troubleshooting?

Sources:

Rus-P · July 8, 2025, 7:29am

But these are for V1. Is there any plan to support deepseek_mtp for V0?

RunLLM · July 8, 2025, 7:30am

There is no plan to add or enhance DeepSeek MTP (multi-token prediction) support for vLLM V0. The vLLM team has frozen V0 development, allowing only minor bug fixes, and is focusing all new features—including DeepSeek MTP and pipeline parallelism—on V1. V0 will be deprecated and removed from the codebase soon, and all major speculative decoding and DeepSeek MTP improvements are being implemented in V1 only. No workaround or patch for V0 is mentioned or planned in the official roadmap or discussions.

Would you like more detail on migration to V1 or feature differences?

Sources:

https://github.com/vllm-project/vllm/issues/18571

Topic		Replies	Views
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	66	July 29, 2025
Will vLLM follow-up DeepSeek's inference system DeepSeek	3	450	May 13, 2025
Why k=1 can achieve inference acceleration for DeepSeek MTP spec decode General	1	40	May 31, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	31	July 24, 2025
What is the optimal combination of parallelism when deploying deepseek-r1 with multiple nodes? General	0	41	May 2, 2025

Speculative decoding is currently incompatible with pipeline parallelism

Related topics