Question
I am looking into vLLM’s speculative decoding tree mode and noticed that the current draft tree configured by --speculative-token-tree appears to be static during inference.
Would vLLM maintainers be open to supporting dynamic pruning of the draft tree at runtime, where low-confidence branches can be skipped based on token probabilities?
I would like to understand whether this direction fits vLLM’s speculative decoding roadmap, and what the preferred implementation approach would be before starting a PR.
Current behavior I observed
The draft tree topology is configured through speculative_token_tree, for example:
"[(0,), (1,), (0,0), (0,1), (1,0), (1,1)]"
From reading the code, the tree shape seems to be resolved into fixed per-level child counts such as child_drafts_per_level in vllm/v1/spec_decode/llm_base_proposer.py.
During propose_tree, each level expands nodes with the same fixed number of children:
num_children = self.child_drafts_per_level[level]
if num_children == 1:
draft_token_ids = logits.argmax(dim=-1).view(batch_size, -1)
else:
draft_token_ids = torch.topk(logits, num_children, dim=-1).indices.view(
batch_size, -1
)
My understanding is that this means the number of draft tokens is effectively fixed by the static tree topology, regardless of how confident or uncertain the draft model is at each node.
Is this understanding correct?
Why I am asking
For some requests, the draft model may be highly confident on a branch, so expanding many sibling tokens may waste compute. For other requests, the distribution may be flatter, and more candidates may be useful.
A static tree seems less flexible in mixed workloads, especially when different sequences in the same batch have very different levels of uncertainty.
I am wondering whether vLLM could optionally support a pruning mechanism such as:
- prune a node if its next-token max probability is below a threshold;
- prune a branch if its cumulative path probability is below a threshold;
- keep the current static tree as the maximum tree shape, but dynamically skip low-value subtrees at runtime.
Related speculative decoding work such as SpecInfer and EAGLE-2 discusses adaptive or uncertainty-aware draft tree construction/pruning, so I am interested in whether a similar idea would make sense in vLLM.
Expected benefit if supported
If implemented efficiently, dynamic pruning may help reduce unnecessary draft-token computation on high-confidence prefixes while preserving the ability to explore more candidates on uncertain branches.
Potential metrics to evaluate:
- generated tokens/s;
- mean number of draft tokens per step;
- accepted tokens per step;
- acceptance rate;
- latency distribution under mixed-difficulty batches.
Willingness to contribute
I am willing to work on a prototype, especially the proposer-side pruning logic and benchmarks.
Before starting, I would appreciate guidance on the preferred design for the tree attention metadata, KV cache allocation, and verifier-side variable-length handling.