Will vLLM follow-up DeepSeek's inference system

a120092009 · May 13, 2025, 2:42am

I noticed recent benchmarks that SGLang has shown comparable performance to DeepSeek’s inference system in public evaluations. Will vLLM team has any performance optimization plans or roadmap alignment with DeepSeek?

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org

RunLLM · May 13, 2025, 2:42am

Yes, the vLLM team is actively working on performance optimizations and roadmap alignment with DeepSeek, including large-scale expert parallelism and PD (Prefill-Decode) disaggregation. The Q2 2025 roadmap specifically lists “Expert Parallelism” (with DeepEP and further fused/cutlass-based MoE kernels), “DeepSeek Specific Enhancements” (MLA, MTP, FlashAttention, Blackwell Kernels), and “Prefill Decode Disaggregation” (1P1D, XPYD, KV cache offload, and integration with LMCache/Mooncake) as core priorities for cluster-scale serving and DeepSeek models. These efforts are aimed at matching or exceeding the performance of systems like SGLang and DeepSeek’s own inference stack, especially for large deployments (e.g., 96 H100s) and advanced MoE models. See the vLLM Q2 2025 Roadmap and disaggregated prefill RFC for details.

Would you like a breakdown of current implementation status, best practices for DeepSeek deployment, or more detail on PD disaggregation and expert parallelism in vLLM?

Sources:

simon-mo · May 13, 2025, 3:39am

Yup! We are in the final stretch of DP + EP + PD set up, as well as in progress for dual batch overlap and other features. Please hang tight!

RunLLM · May 13, 2025, 3:39am

vLLM is indeed in the final stages of supporting Data Parallelism (DP), Expert Parallelism (EP), and Prefill-Decode (PD) disaggregation, with ongoing work on dual batch overlap and related features. These improvements are tracked in the Q2 2025 roadmap and DP+EP RFC, and are designed to enable large-scale, high-throughput deployments similar to DeepSeek and SGLang.

Would you like more technical details on the implementation status or configuration for these features?

Sources:

Topic		Replies	Views
Speculative decoding is currently incompatible with pipeline parallelism General	3	61	July 8, 2025
The vllm/vllm-openai version 0.9.1 is nearly 30% faster compared to lmsysorg/sglang:v0.4.7.post, but it stops running every two to three hours DeepSeek	0	102	June 23, 2025
Init DeepSeek-R1 using Offline Batched Inference General	3	129	May 18, 2025
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	38	July 6, 2025
What is the optimal combination of parallelism when deploying deepseek-r1 with multiple nodes? General	0	31	May 2, 2025

Will vLLM follow-up DeepSeek's inference system

Related topics