I noticed recent benchmarks that SGLang has shown comparable performance to DeepSeek’s inference system in public evaluations. Will vLLM team has any performance optimization plans or roadmap alignment with DeepSeek?
Yes, the vLLM team is actively working on performance optimizations and roadmap alignment with DeepSeek, including large-scale expert parallelism and PD (Prefill-Decode) disaggregation. The Q2 2025 roadmap specifically lists “Expert Parallelism” (with DeepEP and further fused/cutlass-based MoE kernels), “DeepSeek Specific Enhancements” (MLA, MTP, FlashAttention, Blackwell Kernels), and “Prefill Decode Disaggregation” (1P1D, XPYD, KV cache offload, and integration with LMCache/Mooncake) as core priorities for cluster-scale serving and DeepSeek models. These efforts are aimed at matching or exceeding the performance of systems like SGLang and DeepSeek’s own inference stack, especially for large deployments (e.g., 96 H100s) and advanced MoE models. See the vLLM Q2 2025 Roadmap and disaggregated prefill RFC for details.
Would you like a breakdown of current implementation status, best practices for DeepSeek deployment, or more detail on PD disaggregation and expert parallelism in vLLM?
Sources:
Yup! We are in the final stretch of DP + EP + PD set up, as well as in progress for dual batch overlap and other features. Please hang tight!
vLLM is indeed in the final stages of supporting Data Parallelism (DP), Expert Parallelism (EP), and Prefill-Decode (PD) disaggregation, with ongoing work on dual batch overlap and related features. These improvements are tracked in the Q2 2025 roadmap and DP+EP RFC, and are designed to enable large-scale, high-throughput deployments similar to DeepSeek and SGLang.
Would you like more technical details on the implementation status or configuration for these features?
Sources: