60秒内没找到可用的内存广播块(No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).)

硬件环境 (Hardware Background)

  • 服务器型号: H3C UniServer R5600 G6

  • 内存容量: 256GB

  • GPU 配置: NVIDIA A100 80G * 2

  • 物理连接: 基于 NVLink Bridge (桥接器) 连接

  • 拓扑信息 (nvidia-smi topo -m):

nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    0-31,64-95      0               N/A
GPU1    NV12     X      0-31,64-95      0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

软件环境与测试矩阵

所有组合都是相同报错:

  • vLLM 测试版本: 0.14.0、0.15.0、0.16.0

  • CUDA Toolkit 与驱动测试组合:

    1. CUDA Toolkit 12.8 & NVIDIA Driver 570

    2. CUDA Toolkit 12.9 & NVIDIA Driver 580

    3. CUDA Toolkit 13.1 & NVIDIA Driver 590

核心报错信息 (Error Message)

(APIServer pid=24935) WARNING 03-07 05:00:38 [protocol.py:117] The following fields were present in the request but ignored: {'enable_thinking'}
(APIServer pid=24935) INFO 03-07 05:00:38 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=24935) INFO:     127.0.0.1:33842 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=24935) INFO 03-07 05:00:38 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
(EngineCore_DP0 pid=25136) INFO 03-07 05:01:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:02:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:03:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:04:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0) with config: model='/opt/models/Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='/opt/models/Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-a0bcdeea6eb05f41-944f00c7'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[11668],num_output_tokens=[124]), num_scheduled_tokens={chatcmpl-a0bcdeea6eb05f41-944f00c7: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[730], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.000962526090390381, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] Traceback (most recent call last):
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 336, in get_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     status, result = mq.dequeue(
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     return next(self.gen)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 536, in acquire_read
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     raise TimeoutError
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] TimeoutError
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] 
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] 
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] Traceback (most recent call last):
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 929, in run_engine_core
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 956, in run_busy_loop
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     self._process_engine_step()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 989, in _process_engine_step
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     model_output = future.result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     return super().result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     return self.__get_result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     raise self._exception
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 340, in get_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] TimeoutError: RPC call to sample_tokens timed out.
(Worker_TP0 pid=25334) INFO 03-07 05:05:49 [multiproc_executor.py:707] Parent process exited, terminating worker
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] AsyncLLM output_handler failed.
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] Traceback (most recent call last):
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 502, in output_handler
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]     outputs = await engine_core.get_output_async()
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]     raise self._format_exception(outputs) from None
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker_TP1 pid=25335) INFO 03-07 05:05:49 [multiproc_executor.py:707] Parent process exited, terminating worker
(APIServer pid=24935) INFO:     Shutting down
(APIServer pid=24935) INFO:     Waiting for application shutdown.
(APIServer pid=24935) INFO:     Application shutdown complete.
(APIServer pid=24935) INFO:     Finished server process [24935]
/opt/miniconda3/envs/vllm_14/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/miniconda3/envs/vllm_14/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

已尝试的排查方案 (Troubleshooting Done)

  • P2P 环境参数: 针对环境变量 NCCL_P2P_DISABLE,设置过 1 进行测试,依然无法跳过超时环节。

你的报错“TimeoutError: RPC call to sample_tokens timed out.”,并伴随“No available shared memory broadcast block found in 60 seconds”,是vLLM多卡推理常见的分布式通信/共享内存超时问题。该问题与NVIDIA A100 NVLink直连、CUDA 12.8/12.9/13.1、vLLM 0.14~0.16等环境组合无关,属于vLLM分布式通信机制本身的已知问题,尤其在模型编译、权重/kv cache量化等操作耗时较长时更易复现。

根据官方建议,主要排查和解决方向如下:

  1. 增加超时时间:设置环境变量export VLLM_RPC_TIMEOUT=300(或更大),以避免默认60秒超时导致的进程挂起或通信失败;
  2. 升级vLLM版本:vLLM 0.13.0及以前P/D分离有已知稳定性问题,建议升级到0.16.x及以上版本;
  3. 确保配置一致:多卡/多进程下,所有worker的编译参数、--kv-transfer-config等必须完全一致,尤其是compile_ranges_split_points等参数,否则会导致decode节点等待不到预期格式的KV块而超时;
  4. 检查CPU负载:如有JIT编译、FlashInfer等内核编译任务,需确保CPU资源充足,否则会拖慢worker进程,间接导致超时;
  5. NCCL/驱动问题:部分NCCL版本或驱动bug也可能导致通信异常,可尝试升级NCCL或驱动,或设置NCCL_CUMEM_HOST_ENABLE=0等参数规避。

详细排查建议和环境变量设置可参考:

你需要更详细的排查步骤或配置示例吗?

Sources:

Hint: Mention @RunLLM in the post for followups.