Distributed Speculative Decoding using Ray

Loris-EPFL · February 11, 2026, 6:38pm

Good morning,

I have a dgx spark and a rtx 5090 node, which I use in a ray cluster using pipeline parrallelism and asymetric layer assignment. I looked vLLM Speculative Decoding docs, and would like to run the speculative model on the rtx 5090 node (since its small), and the verifier model on the DGX Spark (since it is larger) using ray backend. Is that possible or is the Speculative Decoding models running only on a single GPU or a single node ?

RunLLM · February 11, 2026, 6:38pm

vLLM’s speculative decoding currently requires the draft (speculative) model and the target (verifier) model to be launched together within the same vLLM process, and the draft model is typically run without tensor parallelism, while the main model can use tensor parallelism. There is no explicit support in the documentation for assigning the draft and verifier models to different nodes; both are managed internally by vLLM and are not independently scheduled across nodes in a Ray cluster. Thus, running the draft model on one node (e.g., RTX 5090) and the verifier on another (e.g., DGX Spark) is not supported as a standard feature—both models are colocated within the same process/group for efficiency and simplicity, even when using Ray as the backend. See vLLM Speculative Decoding docs and parallelism scaling.

Would you like more detail on how speculative decoding is scheduled or possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

Loris-EPFL · February 11, 2026, 7:53pm

@RunLLM is there a workarround to specifically assign the speculative and decoder for each node, for instance using an undocumented ENV flag ?

RunLLM · February 11, 2026, 7:54pm

It seems likely that there is no documented or supported way to assign the speculative (draft) model and the verifier (target) model to different nodes, even with environment variables or undocumented flags. The speculative decoding logic in vLLM is designed to manage both models within the same process, and there is no mention in the docs or issues of an ENV flag or hack to split their execution across nodes. All parallelism and placement options (e.g., tensor/pipeline parallelism, Ray backend) apply to the entire vLLM engine, not to individual models within speculative decoding. See Speculative Decoding docs and related GitHub issues.

Would you like a step-by-step explanation of how vLLM manages model placement and why this limitation exists?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
GLM 5.1 PP support DeepSeek	1	121	May 9, 2026
vLLM + Ray: exclude/include specific GPUs on a running cluster? General	3	291	August 14, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	139	July 24, 2025
How to use speculative decoding? Speculative Decoding	3	1063	May 1, 2025
Standalone draft model spec decode support in v0.x and v1 Speculative Decoding	3	211	January 20, 2026

Distributed Speculative Decoding using Ray

Related topics