DeepSeek MTP full cuda graph support?

Hey guys!
I am working on vLLM 0.17.1, and I am running DeepSeek R1 with Spec Config like that:

{
  "method": "deepseek_mtp",
  "num_speculative_tokens": 3,
  "draft_tensor_parallel_size": 16
}

And of cource for best performance, I use the FULL CUDA GRAPH when running.
But I found that the draft model part (or MTP layer part) is ruuning on the PIECEWISE CUDA GRAPH thought the main model is the FULL CUDA GRAPH in the all-decoding phase.

And I wonder if there is any method or config can turn draft model to FULL CUDA GRAPH, or has vLLM support this feature yet?:thinking:

As of vLLM 0.17.1, full CUDA Graph support for the drafter (including DeepSeek MTP) is not available in mainline vLLM; the drafter runs in PIECEWISE CUDA Graph mode even if the target model uses FULL CUDA Graphs. However, there is active development to enable FULL CUDA Graph mode for the drafter, with a PR implementing this feature under review and tracked in this issue and this PR. No config or flag in v0.17.1 enables FULL CUDA Graph for the drafter yet.

Once merged, this feature will allow the drafter (including MTP layers) to leverage FULL CUDA Graph mode, improving speculative decoding performance. For now, you must wait for the relevant PR to be merged and released in a future vLLM version. See discussion and technical details in the tracking issue and the PR.

Would you like more detail on the technical implementation or how to track when this feature lands?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM scheduler 和 gpu_model_runner 通过什么通信

scheduler(调度器)和gpu_model_runner(GPU模型执行器)之间通过进程间通信(IPC)机制进行数据交换,主要依赖于multiprocessing队列、共享内存(如NCCL、SHM等)和Python的Future对象。调度器将调度结果(如SchedulerOutput)通过队列或共享内存发送给gpu_model_runner,后者执行模型推理后再通过队列返回结果(如ModelRunnerOutput)给调度器。具体实现可参考架构设计文档和相关源码说明。

这种通信方式保证了调度与模型执行的解耦和高效并发,支持多GPU和分布式场景。详细流程和接口定义可见于vllm/v1/engine/core.pyvllm/v1/worker/gpu_model_runner.py

需要更详细的通信流程或源码解读吗?

Sources:

Hint: Mention @RunLLM in the post for followups.