Is there any plan to support deepseek_mtp when PP>1?Or any workaround patch
Yes, vLLM now supports DeepSeek MTP (multi-token prediction) with pipeline parallelism (PP > 1). You can enable this by setting --speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 1}'
along with your desired --pipeline-parallel-size
and --tensor-parallel-size
when launching vLLM. There is no need to export or import the MTP module separately; DeepSeek model weights already include the MTP layer. This feature has been tested and confirmed to work with PP > 1 on DeepSeek-R1 models, and is compatible with both tensor and pipeline parallelism according to recent merged PRs and user confirmations in the vLLM repository. For example:
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 1}'
Would you like more detail on configuration or troubleshooting?
Sources:
But these are for V1. Is there any plan to support deepseek_mtp for V0?
There is no plan to add or enhance DeepSeek MTP (multi-token prediction) support for vLLM V0. The vLLM team has frozen V0 development, allowing only minor bug fixes, and is focusing all new features—including DeepSeek MTP and pipeline parallelism—on V1. V0 will be deprecated and removed from the codebase soon, and all major speculative decoding and DeepSeek MTP improvements are being implemented in V1 only. No workaround or patch for V0 is mentioned or planned in the official roadmap or discussions.
Would you like more detail on migration to V1 or feature differences?
Sources: