I want to try DBO to check the running effect of MOE model, and then I will try to run the case given in the demo first. So I installed DeepEP(python setup.py install not error)。I have tested:
[nopass] python tests/test_intranode.py
[pass] python tests/test_internode.py
[nopass] python tests/test_low_latency.py
# fish set VLLM_ALL2ALL_BACKEND
set -x VLLM_ALL2ALL_BACKEND deepep_low_latency
vllm serve deepseek-ai/DeepSeek-V2-Lite --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --max-model-len 4096 --max-num-seqs 64 --data-parallel-size 2 --enable-dbo
this is Log:
(EngineCore_DP0 pid=808262) INFO 12-01 15:01:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='deepseek-ai/DeepSeek-V2-Lite', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V2-Lite', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=deepseek-ai/DeepSeek-V2-Lite, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 128, 'local_cache_dir': None}
(EngineCore_DP0 pid=808262) INFO 12-01 15:01:51 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57385 backend=nccl
(EngineCore_DP1 pid=808263) INFO 12-01 15:01:51 [parallel_state.py:1208] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57385 backend=nccl
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP0 pid=808262) INFO 12-01 15:01:52 [pynccl.py:111] vLLM is using nccl==2.27.7
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP1 pid=808263) [2025-12-01 15:01:53] INFO _optional_torch_c_dlpack.py:88: JIT-compiling torch-c-dlpack-ext to cache...
(EngineCore_DP0 pid=808262) [2025-12-01 15:01:53] INFO _optional_torch_c_dlpack.py:88: JIT-compiling torch-c-dlpack-ext to cache...
try node in CI script:
pytest -v -s tests/v1/distributed/test_dbo.py
ImportError while loading conftest '/data/liushuai/envs/vllm/tests/conftest.py'.
tests/conftest.py:49: in <module>
from vllm import LLM, SamplingParams, envs
vllm/__init__.py:74: in __getattr__
module = import_module(module_name, __package__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm/entrypoints/llm.py:83: in <module>
from vllm.v1.engine.llm_engine import LLMEngine
vllm/v1/engine/llm_engine.py:30: in <module>
from vllm.v1.engine.core_client import EngineCoreClient
vllm/v1/engine/core_client.py:42: in <module>
from vllm.v1.engine.core import EngineCore, EngineCoreProc
vllm/v1/engine/core.py:54: in <module>
from vllm.v1.engine.utils import (
vllm/v1/engine/utils.py:26: in <module>
from vllm.v1.executor import Executor
vllm/v1/executor/__init__.py:3: in <module>
from .abstract import Executor
vllm/v1/executor/abstract.py:11: in <module>
from vllm.distributed.kv_transfer.kv_connector.utils import KVOutputAggregator
vllm/distributed/kv_transfer/kv_connector/utils.py:12: in <module>
from vllm import _custom_ops as ops
vllm/_custom_ops.py:1176: in <module>
@register_fake("_C::gptq_marlin_repack")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/home/liushuai/vllm-deepep-dev/lib/python3.12/site-packages/torch/library.py:1063: in register
use_lib._register_fake(
/home/liushuai/vllm-deepep-dev/lib/python3.12/site-packages/torch/library.py:211: in _register_fake
handle = entry.fake_impl.register(
/home/liushuai/vllm-deepep-dev/lib/python3.12/site-packages/torch/_library/fake_impl.py:51: in register
raise RuntimeError(
E RuntimeError: register_fake(...): the operator _C::gptq_marlin_repack already has an DispatchKey::Meta implementation via a pre-existing torch.library or TORCH_LIBRARY registration. Please either remove that registration or don't call register_fake.
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
I want to know how to run a DBO moe model on a single machine。env:
h20x8
deep-ep 1.2.1+bfded34
vllm 0.11.2