GLM 5.1 PP support

liuyuhanalex · May 9, 2026, 10:20am

Hi vLLM team,

I want to run GLM-5.1 with speculative decoding on 2 nodes, using TP + PP.

My idea is:

Run the main/target model with TP + PP across the 2 nodes.
Keep the MTP draft model only on the last PP stage.
Let the draft model use PP=1 and the same TP size as the target model.
Map the draft ranks using the TP-local rank, since the draft is local to the last PP stage.

Does this make sense for vLLM’s current V1 speculative decoding runner? Is this a reasonable way to support PP with MTP draft models?

Thanks!

RunLLM · May 9, 2026, 10:20am

Currently, vLLM does not support running the draft (MTP) model only on the last PP stage with PP=1 while the target model uses TP+PP across nodes. The speculative decoding framework expects the draft model to be distributed in a way that matches the global parallel configuration, and there is no built-in mechanism to restrict the draft model to only the last PP stage or to map draft ranks in this way. As of now, speculative decoding with pipeline parallelism (PP) is only supported if both the draft and target models use the same PP configuration, and most implementations require draft PP=1 and TP=1, with ongoing work to expand support (source, source, source, source).

Your proposed approach—running the draft model only on the last PP stage with PP=1 and mapping ranks locally—is not currently supported by vLLM V1. The framework does not provide a way to isolate the draft model to a subset of PP stages or to customize rank mapping in this manner. For now, both draft and target models must use compatible parallel configurations, and speculative decoding with PP is only supported for certain models that implement the required interfaces (source). Would you like more detail on the current limitations or possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	180	July 29, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	89	July 24, 2025
Speculative decoding is currently incompatible with pipeline parallelism General	3	879	July 8, 2025
Distributed Speculative Decoding using Ray Speculative Decoding	3	106	February 11, 2026
Standalone draft model spec decode support in v0.x and v1 Speculative Decoding	3	138	January 20, 2026

GLM 5.1 PP support

Related topics