What is the recommended way to support dynamic pruning for speculative decoding draft trees?

Question

I am looking into vLLM’s speculative decoding tree mode and noticed that the current draft tree configured by --speculative-token-tree appears to be static during inference.

Would vLLM maintainers be open to supporting dynamic pruning of the draft tree at runtime, where low-confidence branches can be skipped based on token probabilities?

I would like to understand whether this direction fits vLLM’s speculative decoding roadmap, and what the preferred implementation approach would be before starting a PR.


Current behavior I observed

The draft tree topology is configured through speculative_token_tree, for example:

"[(0,), (1,), (0,0), (0,1), (1,0), (1,1)]"

From reading the code, the tree shape seems to be resolved into fixed per-level child counts such as child_drafts_per_level in vllm/v1/spec_decode/llm_base_proposer.py.

During propose_tree, each level expands nodes with the same fixed number of children:

num_children = self.child_drafts_per_level[level]
if num_children == 1:
    draft_token_ids = logits.argmax(dim=-1).view(batch_size, -1)
else:
    draft_token_ids = torch.topk(logits, num_children, dim=-1).indices.view(
        batch_size, -1
    )

My understanding is that this means the number of draft tokens is effectively fixed by the static tree topology, regardless of how confident or uncertain the draft model is at each node.

Is this understanding correct?


Why I am asking

For some requests, the draft model may be highly confident on a branch, so expanding many sibling tokens may waste compute. For other requests, the distribution may be flatter, and more candidates may be useful.

A static tree seems less flexible in mixed workloads, especially when different sequences in the same batch have very different levels of uncertainty.

I am wondering whether vLLM could optionally support a pruning mechanism such as:

  • prune a node if its next-token max probability is below a threshold;
  • prune a branch if its cumulative path probability is below a threshold;
  • keep the current static tree as the maximum tree shape, but dynamically skip low-value subtrees at runtime.

Related speculative decoding work such as SpecInfer and EAGLE-2 discusses adaptive or uncertainty-aware draft tree construction/pruning, so I am interested in whether a similar idea would make sense in vLLM.


Expected benefit if supported

If implemented efficiently, dynamic pruning may help reduce unnecessary draft-token computation on high-confidence prefixes while preserving the ability to explore more candidates on uncertain branches.

Potential metrics to evaluate:

  • generated tokens/s;
  • mean number of draft tokens per step;
  • accepted tokens per step;
  • acceptance rate;
  • latency distribution under mixed-difficulty batches.

Willingness to contribute

I am willing to work on a prototype, especially the proposer-side pruning logic and benchmarks.

Before starting, I would appreciate guidance on the preferred design for the tree attention metadata, KV cache allocation, and verifier-side variable-length handling.

Yes, your understanding is correct: currently, vLLM’s tree-based speculative decoding uses a static tree topology configured via speculative_token_tree, and the number of draft tokens per level is fixed at inference time (source, source). Dynamic pruning—where branches are skipped based on token or path probabilities—is not yet implemented, but there is clear interest and ongoing discussion about adding this feature (source, source).

Maintainers are open to supporting dynamic pruning for draft trees, and there is an API proposal for adding per-node and path-probability pruning thresholds to SpeculativeConfig (e.g., speculative_tree_prune_min_prob, speculative_tree_prune_path_prob). The preferred approach is to keep the static tree as the maximum shape and prune subtrees at runtime based on configurable thresholds (source). Contributors are encouraged to prototype the pruning logic and tree-metadata changes, and maintainers are seeking input on API shape and integration points. Would you like more detail on the implementation or design considerations?

Sources:

Hint: Mention @RunLLM in the post for followups.