60秒内没找到可用的内存广播块（No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).）

JustinWang · March 7, 2026, 5:11am

硬件环境 (Hardware Background)

服务器型号： H3C UniServer R5600 G6
内存容量： 256GB
GPU 配置： NVIDIA A100 80G * 2
物理连接： 基于 NVLink Bridge (桥接器) 连接
拓扑信息 (nvidia-smi topo -m)：

nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    0-31,64-95      0               N/A
GPU1    NV12     X      0-31,64-95      0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

软件环境与测试矩阵

所有组合都是相同报错：

vLLM 测试版本： 0.14.0、0.15.0、0.16.0
CUDA Toolkit 与驱动测试组合：
1. CUDA Toolkit 12.8 & NVIDIA Driver 570
2. CUDA Toolkit 12.9 & NVIDIA Driver 580
3. CUDA Toolkit 13.1 & NVIDIA Driver 590

核心报错信息 (Error Message)

(APIServer pid=24935) WARNING 03-07 05:00:38 [protocol.py:117] The following fields were present in the request but ignored: {'enable_thinking'}
(APIServer pid=24935) INFO 03-07 05:00:38 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=24935) INFO:     127.0.0.1:33842 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=24935) INFO 03-07 05:00:38 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
(EngineCore_DP0 pid=25136) INFO 03-07 05:01:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:02:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:03:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:04:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0) with config: model='/opt/models/Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='/opt/models/Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-a0bcdeea6eb05f41-944f00c7'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[11668],num_output_tokens=[124]), num_scheduled_tokens={chatcmpl-a0bcdeea6eb05f41-944f00c7: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[730], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.000962526090390381, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] Traceback (most recent call last):
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 336, in get_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     status, result = mq.dequeue(
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     return next(self.gen)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 536, in acquire_read
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     raise TimeoutError
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] TimeoutError
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] 
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] 
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] Traceback (most recent call last):
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 929, in run_engine_core
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 956, in run_busy_loop
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     self._process_engine_step()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 989, in _process_engine_step
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     model_output = future.result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     return super().result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     return self.__get_result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     raise self._exception
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 340, in get_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] TimeoutError: RPC call to sample_tokens timed out.
(Worker_TP0 pid=25334) INFO 03-07 05:05:49 [multiproc_executor.py:707] Parent process exited, terminating worker
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] AsyncLLM output_handler failed.
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] Traceback (most recent call last):
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 502, in output_handler
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]     outputs = await engine_core.get_output_async()
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]   File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546]     raise self._format_exception(outputs) from None
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker_TP1 pid=25335) INFO 03-07 05:05:49 [multiproc_executor.py:707] Parent process exited, terminating worker
(APIServer pid=24935) INFO:     Shutting down
(APIServer pid=24935) INFO:     Waiting for application shutdown.
(APIServer pid=24935) INFO:     Application shutdown complete.
(APIServer pid=24935) INFO:     Finished server process [24935]
/opt/miniconda3/envs/vllm_14/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/miniconda3/envs/vllm_14/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

已尝试的排查方案 (Troubleshooting Done)

P2P 环境参数： 针对环境变量 NCCL_P2P_DISABLE，设置过 1 进行测试，依然无法跳过超时环节。

RunLLM · March 7, 2026, 5:11am

你的报错“TimeoutError: RPC call to sample_tokens timed out.”，并伴随“No available shared memory broadcast block found in 60 seconds”，是vLLM多卡推理常见的分布式通信/共享内存超时问题。该问题与NVIDIA A100 NVLink直连、CUDA 12.8/12.9/13.1、vLLM 0.14~0.16等环境组合无关，属于vLLM分布式通信机制本身的已知问题，尤其在模型编译、权重/kv cache量化等操作耗时较长时更易复现。

根据官方建议，主要排查和解决方向如下：

增加超时时间：设置环境变量export VLLM_RPC_TIMEOUT=300（或更大），以避免默认60秒超时导致的进程挂起或通信失败；
升级vLLM版本：vLLM 0.13.0及以前P/D分离有已知稳定性问题，建议升级到0.16.x及以上版本；
确保配置一致：多卡/多进程下，所有worker的编译参数、--kv-transfer-config等必须完全一致，尤其是compile_ranges_split_points等参数，否则会导致decode节点等待不到预期格式的KV块而超时；
检查CPU负载：如有JIT编译、FlashInfer等内核编译任务，需确保CPU资源充足，否则会拖慢worker进程，间接导致超时；
NCCL/驱动问题：部分NCCL版本或驱动bug也可能导致通信异常，可尝试升级NCCL或驱动，或设置NCCL_CUMEM_HOST_ENABLE=0等参数规避。

详细排查建议和环境变量设置可参考：

你需要更详细的排查步骤或配置示例吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
找不到内存广播块的问题（No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work） General	1	622	February 5, 2026
Help with vLLM crashes General	1	479	December 16, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	159	November 10, 2025
vLLM Tensor Parallel Workers Not Completing Initialization General	3	814	January 21, 2026
RTX PRO 6000 users seek help, LLAMA 4 NVFP4 NVIDIA GPU Support	1	238	November 25, 2025

60秒内没找到可用的内存广播块（No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).）

硬件环境 (Hardware Background)

软件环境与测试矩阵

已尝试的排查方案 (Troubleshooting Done)

Related topics