And of cource for best performance, I use the FULL CUDA GRAPH when running.
But I found that the draft model part (or MTP layer part) is ruuning on the PIECEWISE CUDA GRAPH thought the main model is the FULL CUDA GRAPH in the all-decoding phase.
And I wonder if there is any method or config can turn draft model to FULL CUDA GRAPH, or has vLLM support this feature yet?
As of vLLM 0.17.1, full CUDA Graph support for the drafter (including DeepSeek MTP) is not available in mainline vLLM; the drafter runs in PIECEWISE CUDA Graph mode even if the target model uses FULL CUDA Graphs. However, there is active development to enable FULL CUDA Graph mode for the drafter, with a PR implementing this feature under review and tracked in this issue and this PR. No config or flag in v0.17.1 enables FULL CUDA Graph for the drafter yet.
Once merged, this feature will allow the drafter (including MTP layers) to leverage FULL CUDA Graph mode, improving speculative decoding performance. For now, you must wait for the relevant PR to be merged and released in a future vLLM version. See discussion and technical details in the tracking issue and the PR.
Would you like more detail on the technical implementation or how to track when this feature lands?