硬件环境 (Hardware Background)
-
服务器型号: H3C UniServer R5600 G6
-
内存容量: 256GB
-
GPU 配置: NVIDIA A100 80G * 2
-
物理连接: 基于 NVLink Bridge (桥接器) 连接
-
拓扑信息 (
nvidia-smi topo -m):
nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 0-31,64-95 0 N/A
GPU1 NV12 X 0-31,64-95 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
软件环境与测试矩阵
所有组合都是相同报错:
-
vLLM 测试版本: 0.14.0、0.15.0、0.16.0
-
CUDA Toolkit 与驱动测试组合:
-
CUDA Toolkit 12.8 & NVIDIA Driver 570
-
CUDA Toolkit 12.9 & NVIDIA Driver 580
-
CUDA Toolkit 13.1 & NVIDIA Driver 590
-
核心报错信息 (Error Message)
(APIServer pid=24935) WARNING 03-07 05:00:38 [protocol.py:117] The following fields were present in the request but ignored: {'enable_thinking'}
(APIServer pid=24935) INFO 03-07 05:00:38 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=24935) INFO: 127.0.0.1:33842 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=24935) INFO 03-07 05:00:38 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
(EngineCore_DP0 pid=25136) INFO 03-07 05:01:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:02:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:03:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) INFO 03-07 05:04:49 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0) with config: model='/opt/models/Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='/opt/models/Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None},
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-a0bcdeea6eb05f41-944f00c7'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[11668],num_output_tokens=[124]), num_scheduled_tokens={chatcmpl-a0bcdeea6eb05f41-944f00c7: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[730], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.000962526090390381, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] Traceback (most recent call last):
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 336, in get_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] status, result = mq.dequeue(
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] return next(self.gen)
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 536, in acquire_read
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] raise TimeoutError
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] TimeoutError
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938]
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] Traceback (most recent call last):
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 929, in run_engine_core
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] engine_core.run_busy_loop()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 956, in run_busy_loop
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] self._process_engine_step()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 989, in _process_engine_step
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] model_output = future.result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] return super().result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] return self.__get_result()
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] raise self._exception
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] response = self.aggregate(get_response())
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 340, in get_response
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=25136) ERROR 03-07 05:05:49 [core.py:938] TimeoutError: RPC call to sample_tokens timed out.
(Worker_TP0 pid=25334) INFO 03-07 05:05:49 [multiproc_executor.py:707] Parent process exited, terminating worker
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] AsyncLLM output_handler failed.
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] Traceback (most recent call last):
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 502, in output_handler
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] outputs = await engine_core.get_output_async()
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] File "/opt/miniconda3/envs/vllm_14/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] raise self._format_exception(outputs) from None
(APIServer pid=24935) ERROR 03-07 05:05:49 [async_llm.py:546] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker_TP1 pid=25335) INFO 03-07 05:05:49 [multiproc_executor.py:707] Parent process exited, terminating worker
(APIServer pid=24935) INFO: Shutting down
(APIServer pid=24935) INFO: Waiting for application shutdown.
(APIServer pid=24935) INFO: Application shutdown complete.
(APIServer pid=24935) INFO: Finished server process [24935]
/opt/miniconda3/envs/vllm_14/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/miniconda3/envs/vllm_14/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
已尝试的排查方案 (Troubleshooting Done)
- P2P 环境参数: 针对环境变量
NCCL_P2P_DISABLE,设置过1进行测试,依然无法跳过超时环节。