Hi vLLM team,
I want to run GLM-5.1 with speculative decoding on 2 nodes, using TP + PP.
My idea is:
- Run the main/target model with TP + PP across the 2 nodes.
- Keep the MTP draft model only on the last PP stage.
- Let the draft model use PP=1 and the same TP size as the target model.
- Map the draft ranks using the TP-local rank, since the draft is local to the last PP stage.
Does this make sense for vLLM’s current V1 speculative decoding runner? Is this a reasonable way to support PP with MTP draft models?
Thanks!
Currently, vLLM does not support running the draft (MTP) model only on the last PP stage with PP=1 while the target model uses TP+PP across nodes. The speculative decoding framework expects the draft model to be distributed in a way that matches the global parallel configuration, and there is no built-in mechanism to restrict the draft model to only the last PP stage or to map draft ranks in this way. As of now, speculative decoding with pipeline parallelism (PP) is only supported if both the draft and target models use the same PP configuration, and most implementations require draft PP=1 and TP=1, with ongoing work to expand support (source, source, source, source).
Your proposed approach—running the draft model only on the last PP stage with PP=1 and mapping ranks locally—is not currently supported by vLLM V1. The framework does not provide a way to isolate the draft model to a subset of PP stages or to customize rank mapping in this manner. For now, both draft and target models must use compatible parallel configurations, and speculative decoding with PP is only supported for certain models that implement the required interfaces (source). Would you like more detail on the current limitations or possible workarounds?
Sources:
Hint: Mention @RunLLM in the post for followups.