Distributed Speculative Decoding using Ray

Good morning,

I have a dgx spark and a rtx 5090 node, which I use in a ray cluster using pipeline parrallelism and asymetric layer assignment. I looked vLLM Speculative Decoding docs, and would like to run the speculative model on the rtx 5090 node (since its small), and the verifier model on the DGX Spark (since it is larger) using ray backend. Is that possible or is the Speculative Decoding models running only on a single GPU or a single node ?

vLLM’s speculative decoding currently requires the draft (speculative) model and the target (verifier) model to be launched together within the same vLLM process, and the draft model is typically run without tensor parallelism, while the main model can use tensor parallelism. There is no explicit support in the documentation for assigning the draft and verifier models to different nodes; both are managed internally by vLLM and are not independently scheduled across nodes in a Ray cluster. Thus, running the draft model on one node (e.g., RTX 5090) and the verifier on another (e.g., DGX Spark) is not supported as a standard feature—both models are colocated within the same process/group for efficiency and simplicity, even when using Ray as the backend. See vLLM Speculative Decoding docs and parallelism scaling.

Would you like more detail on how speculative decoding is scheduled or possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM is there a workarround to specifically assign the speculative and decoder for each node, for instance using an undocumented ENV flag ?

It seems likely that there is no documented or supported way to assign the speculative (draft) model and the verifier (target) model to different nodes, even with environment variables or undocumented flags. The speculative decoding logic in vLLM is designed to manage both models within the same process, and there is no mention in the docs or issues of an ENV flag or hack to split their execution across nodes. All parallelism and placement options (e.g., tensor/pipeline parallelism, Ray backend) apply to the entire vLLM engine, not to individual models within speculative decoding. See Speculative Decoding docs and related GitHub issues.

Would you like a step-by-step explanation of how vLLM manages model placement and why this limitation exists?

Sources:

Hint: Mention @RunLLM in the post for followups.