vLLM 是否支持多节点的PD分离部署?如果支持,请提供部署参考命令和kvcache部署。
vLLM 支持多节点的 Prefill-Decode(PD)分离部署,尤其适用于 DeepSeek 等大模型。你需要分别为 Prefill 节点和 Decode 节点启动 vllm serve,并通过 --kv-transfer-config 配置 KV cache 传递。具体命令和参数需参考模型和集群实际情况,常见参数包括 --data-parallel-size、–data-parallel-rank、–data-parallel-address、–data-parallel-rpc-port 以及 --kv-transfer-config。KV cache 的传递依赖于一致的配置和网络连通性,且 Prefill/Decode 节点需使用相同的编译配置和 KV 传输设置,否则可能导致解码端等待 KV 块超时或格式不符等问题。详细部署流程和参数说明可参考 Kimi-K2.5 PD分离部署文档 及 官方多节点部署说明。
KV cache 传递需确保 --kv-transfer-config 参数一致,并建议升级至 vLLM 0.16.x 及以上版本以获得更好的稳定性。若遇到 KV 传输超时,可通过设置环境变量 VLLM_RPC_TIMEOUT 增大超时时间。更多实战经验和问题排查可参考 相关 issue 讨论。需要详细命令和参数示例吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 发起请求,prefill和decode服务出现崩溃,请问是什么原因
A节点有2张L20 (2x48GB),作为prefill。B节点有2张A100 (2x80GB),作为decode。
1、发起请求
curl -X POST http://localhost:10001/v1/chat/completions -H “Content-Type: application/json” -d ‘{
“model”: “/data-new/models/Qwen3-30B-A3B”,
“messages”: [{“role”: “user”, “content”: “hello”}],
“max_tokens”: 10
}’
2、proxy服务没有请求的日志
[2026-05-14 01:34:11 +0000] [26] [INFO] Running on http://0.0.0.0:10001 (CTRL + C to quit)
- Serving Quart app ‘disagg_proxy_p2p_nccl_xpyd’
- Debug mode: False
- Please use an ASGI server (e.g. Hypercorn) directly in production
- Running on http://0.0.0.0:10001 (CTRL + C to quit)
Add [HTTP:10.10.185.25:8000, ZMQ:10.10.185.25:21001]
Add [HTTP:10.10.185.28:8001, ZMQ:10.10.185.28:22001]
在prefill和decode服务崩溃后,有日志
handle_request count: 0, [HTTP:10.10.185.25:8000, ZMQ:10.10.185.25:21001]
[HTTP:10.10.185.28:8001, ZMQ:10.10.185.28:22001]
[2026-05-14 01:46:56 +0000] [26] [INFO] 127.0.0.1:59024 POST /v1/chat/completions 1.1 200 - 600204846
3、prefill有日志
(Worker_TP1 pid=529) INFO 05-14 01:36:56 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {‘NCCL_MAX_NCHANNELS’: None, ‘NCCL_MIN_NCHANNELS’: None, ‘NCCL_CUMEM_ENABLE’: None, ‘NCCL_BUFFSIZE’: ‘4194304’, ‘NCCL_PROTO’: None, ‘NCCL_ALGO’: None}
(Worker_TP0 pid=528) INFO 05-14 01:36:56 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {‘NCCL_MAX_NCHANNELS’: None, ‘NCCL_MIN_NCHANNELS’: None, ‘NCCL_CUMEM_ENABLE’: None, ‘NCCL_BUFFSIZE’: ‘4194304’, ‘NCCL_PROTO’: None, ‘NCCL_ALGO’: None}
(Worker_TP1 pid=529) INFO 05-14 01:36:56 [p2p_nccl_engine.py:226]
ncclCommInitRank Success, 10.10.185.25:21002
10.10.185.28:22002, MyRank:0
(Worker_TP0 pid=528) INFO 05-14 01:36:56 [p2p_nccl_engine.py:226]
ncclCommInitRank Success, 10.10.185.25:21001
10.10.185.28:22001, MyRank:0
(EngineCore pid=328) INFO 05-14 01:37:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=328) INFO 05-14 01:38:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
4、decode有日志
(Worker_TP0 pid=442) INFO 05-14 01:36:56 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {‘NCCL_MAX_NCHANNELS’: None, ‘NCCL_MIN_NCHANNELS’: None, ‘NCCL_CUMEM_ENABLE’: None, ‘NCCL_BUFFSIZE’: ‘4194304’, ‘NCCL_PROTO’: None, ‘NCCL_ALGO’: None}
(Worker_TP1 pid=443) INFO 05-14 01:36:56 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {‘NCCL_MAX_NCHANNELS’: None, ‘NCCL_MIN_NCHANNELS’: None, ‘NCCL_CUMEM_ENABLE’: None, ‘NCCL_BUFFSIZE’: ‘4194304’, ‘NCCL_PROTO’: None, ‘NCCL_ALGO’: None}
(Worker_TP1 pid=443) INFO 05-14 01:36:56 [p2p_nccl_engine.py:387]
ncclCommInitRank Success, 10.10.185.28:22002
10.10.185.25:21002, MyRank:1
(Worker_TP0 pid=442) INFO 05-14 01:36:56 [p2p_nccl_engine.py:387]
ncclCommInitRank Success, 10.10.185.28:22001
10.10.185.25:21001, MyRank:1
5、decode 2张A100有算力100%
6、prefill和decode超时崩溃
(Worker_TP1 pid=529) INFO 05-14 01:36:56 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': None, 'NCCL_BUFFSIZE': '4194304', 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(Worker_TP0 pid=528) INFO 05-14 01:36:56 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': None, 'NCCL_BUFFSIZE': '4194304', 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(Worker_TP1 pid=529) INFO 05-14 01:36:56 [p2p_nccl_engine.py:226] 🤝ncclCommInitRank Success, 10.10.185.25:21002👉10.10.185.28:22002, MyRank:0
(Worker_TP0 pid=528) INFO 05-14 01:36:56 [p2p_nccl_engine.py:226] 🤝ncclCommInitRank Success, 10.10.185.25:21001👉10.10.185.28:22001, MyRank:0
(EngineCore pid=328) INFO 05-14 01:37:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=328) INFO 05-14 01:38:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=328) INFO 05-14 01:39:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=328) INFO 05-14 01:40:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=328) INFO 05-14 01:41:56 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=328) ERROR 05-14 01:41:56 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.20.1) with config: model='/data-new/models/Qwen3-30B-A3B', speculative_config=None, tokenizer='/data-new/models/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data-new/models/Qwen3-30B-A3B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto'),
(EngineCore pid=328) ERROR 05-14 01:41:56 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-___prefill_addr_10.10.185.25:21001___decode_addr_10.10.185.28:22001_316d4c0aaca749f7bab59fbe455303b3-8dee97c6,prompt_token_ids_len=9,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-___prefill_addr_10.10.185.25:21001___decode_addr_10.10.185.28:22001_316d4c0aaca749f7bab59fbe455303b3-8dee97c6: 9}, total_num_scheduled_tokens=9, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=P2pNcclConnectorMetadata(requests=[ReqMeta(request_id='chatcmpl-___prefill_addr_10.10.185.25:21001___decode_addr_10.10.185.28:22001_316d4c0aaca749f7bab59fbe455303b3-8dee97c6', block_ids=Tensor(shape=torch.Size([1]), device=cpu,dtype=torch.int64), num_tokens=9)]), ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=328) ERROR 05-14 01:41:56 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, num_skipped_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=9.789525208026006e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=9, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=9, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] Traceback (most recent call last):
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 386, in get_response
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] return next(self.gen)
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] raise TimeoutError
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] TimeoutError
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138]
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] The above exception was the direct cause of the following exception:
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138]
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] Traceback (most recent call last):
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1129, in run_engine_core
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] engine_core.run_busy_loop()
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1170, in run_busy_loop
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] self._process_engine_step()
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1209, in _process_engine_step
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] outputs, model_executed = self.step_fn()
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 521, in step_with_batch_queue
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] model_output = future.result()
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] return super().result()
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] return self.__get_result()
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] raise self._exception
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] response = self.aggregate(self.get_response())
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 388, in get_response
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore pid=328) ERROR 05-14 01:41:56 [core.py:1138] TimeoutError: RPC call to sample_tokens timed out.
(Worker_TP0 pid=528) INFO 05-14 01:41:56 [multiproc_executor.py:775] Parent process exited, terminating worker queues
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] AsyncLLM output_handler failed.
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] Traceback (most recent call last):
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] outputs = await engine_core.get_output_async()
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] raise self._format_exception(outputs) from None
(APIServer pid=125) ERROR 05-14 01:41:56 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=125) INFO: 10.10.185.28:33892 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=125) INFO: Shutting down
(APIServer pid=125) INFO: Waiting for application shutdown.
(APIServer pid=125) INFO: Application shutdown complete.
(APIServer pid=125) INFO: Finished server process [125]