Has anyone successfully run DBO in a single node multi card environment？

bleedingfight · December 1, 2025, 7:34am

I want to try DBO to check the running effect of MOE model, and then I will try to run the case given in the demo first. So I installed DeepEP(python setup.py install not error)。I have tested:
[nopass] python tests/test_intranode.py
[pass] python tests/test_internode.py
[nopass] python tests/test_low_latency.py

# fish set VLLM_ALL2ALL_BACKEND
set -x VLLM_ALL2ALL_BACKEND deepep_low_latency
vllm serve deepseek-ai/DeepSeek-V2-Lite --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --max-model-len 4096 --max-num-seqs 64 --data-parallel-size 2 --enable-dbo

this is Log:

(EngineCore_DP0 pid=808262) INFO 12-01 15:01:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='deepseek-ai/DeepSeek-V2-Lite', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V2-Lite', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=deepseek-ai/DeepSeek-V2-Lite, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 128, 'local_cache_dir': None}
(EngineCore_DP0 pid=808262) INFO 12-01 15:01:51 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57385 backend=nccl
(EngineCore_DP1 pid=808263) INFO 12-01 15:01:51 [parallel_state.py:1208] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57385 backend=nccl
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP0 pid=808262) INFO 12-01 15:01:52 [pynccl.py:111] vLLM is using nccl==2.27.7
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP1 pid=808263) [2025-12-01 15:01:53] INFO _optional_torch_c_dlpack.py:88: JIT-compiling torch-c-dlpack-ext to cache...
(EngineCore_DP0 pid=808262) [2025-12-01 15:01:53] INFO _optional_torch_c_dlpack.py:88: JIT-compiling torch-c-dlpack-ext to cache...

try node in CI script:

pytest -v -s tests/v1/distributed/test_dbo.py
ImportError while loading conftest '/data/liushuai/envs/vllm/tests/conftest.py'.
tests/conftest.py:49: in <module>
    from vllm import LLM, SamplingParams, envs
vllm/__init__.py:74: in __getattr__
    module = import_module(module_name, __package__)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm/entrypoints/llm.py:83: in <module>
    from vllm.v1.engine.llm_engine import LLMEngine
vllm/v1/engine/llm_engine.py:30: in <module>
    from vllm.v1.engine.core_client import EngineCoreClient
vllm/v1/engine/core_client.py:42: in <module>
    from vllm.v1.engine.core import EngineCore, EngineCoreProc
vllm/v1/engine/core.py:54: in <module>
    from vllm.v1.engine.utils import (
vllm/v1/engine/utils.py:26: in <module>
    from vllm.v1.executor import Executor
vllm/v1/executor/__init__.py:3: in <module>
    from .abstract import Executor
vllm/v1/executor/abstract.py:11: in <module>
    from vllm.distributed.kv_transfer.kv_connector.utils import KVOutputAggregator
vllm/distributed/kv_transfer/kv_connector/utils.py:12: in <module>
    from vllm import _custom_ops as ops
vllm/_custom_ops.py:1176: in <module>
    @register_fake("_C::gptq_marlin_repack")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/home/liushuai/vllm-deepep-dev/lib/python3.12/site-packages/torch/library.py:1063: in register
    use_lib._register_fake(
/home/liushuai/vllm-deepep-dev/lib/python3.12/site-packages/torch/library.py:211: in _register_fake
    handle = entry.fake_impl.register(
/home/liushuai/vllm-deepep-dev/lib/python3.12/site-packages/torch/_library/fake_impl.py:51: in register
    raise RuntimeError(
E   RuntimeError: register_fake(...): the operator _C::gptq_marlin_repack already has an DispatchKey::Meta implementation via a pre-existing torch.library or TORCH_LIBRARY registration. Please either remove that registration or don't call register_fake.
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

I want to know how to run a DBO moe model on a single machine。env：

h20x8
deep-ep                           1.2.1+bfded34
vllm                                  0.11.2

RunLLM · December 1, 2025, 7:35am

To run a DBO (Dual Batch Overlap) MoE model on a single machine, you need to: (1) install DeepEP and its dependencies, (2) use a command like vllm serve deepseek-ai/DeepSeek-V2-Lite --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-dbo --all2all-backend deepep_low_latency, and (3) ensure at least two GPUs are available. DBO requires --enable-expert-parallel, --data-parallel-size N (N > 1), --enable-dbo, and --all2all-backend set to either deepep_low_latency or deepep_high_throughput with DeepEP installed. The log output you provided shows correct initialization for a single-node, 2-GPU DBO MoE run, but your test failure is unrelated to DBO serving and is due to a duplicate operator registration in your Python environment, not a vLLM or DBO configuration issue. This is likely caused by conflicting or multiple installations of vLLM or its dependencies in your environment (see error details).

To fix the ImportError, clean your Python environment to ensure only one vLLM installation exists, and avoid mixing different vLLM or torch versions. After resolving the environment issue, you can run DBO MoE on a single machine as described above. For more details and troubleshooting, see Dual Batch Overlap documentation and Expert Parallel Deployment.

Would you like a step-by-step breakdown of the setup and troubleshooting process?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
What is the optimal combination of parallelism when deploying deepseek-r1 with multiple nodes? General	0	63	May 2, 2025
Expert Parallelism All-to-All Communication without NVLink and DeepEP General	3	60	March 3, 2026
咨询部署方案：DeepSeek-R1-671B 在 12x8卡H20集群上 - 分布式推理 vs 多实例负载均衡的推理方案对比 General	1	372	June 30, 2025
High-Throughput kernel on single-node Benchmarking	1	157	June 23, 2025
Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ General	43	515	August 25, 2025

Has anyone successfully run DBO in a single node multi card environment？

Related topics