When using large batches, the Ray service crashes.ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read

你的脚本已经是vLLM pipeline高效推理的标准写法,但还可以通过以下方式进一步提升速度和吞吐:


1. batch_size/并发调优

  • batch_size越大,吞吐越高,但受限于GPU显存和max_model_len。建议逐步增大batch_size,直到显存接近上限但不报OOM为止。
  • 可设置max_num_batched_tokens(如65536或更高)和max_num_seqs(如256、512),提升大batch吞吐。官方建议

2. Engine参数优化

  • enable_chunked_prefill=Trueenable_prefix_caching=True已开启,继续保持。
  • dtype="half"(float16)或"bfloat16",能显著提升显存利用率和速度。
  • gpu_memory_utilization可适当调高(如0.95),但要避免OOM。

3. prompt/template与图片插入

  • 你已用multi_modal_data传图片,效率最高。
  • prompt建议用chat_template的多模态格式(content为list,图片用{“type”: “image”},文本用{“type”: “text”, “text”: …}),这样tokenizer能自动处理图片token,速度和准确率都最佳。

4. 其它建议

  • 避免在主循环内频繁打印和深拷贝大对象,减少I/O和CPU开销。
  • 若数据量极大,可用多进程/多线程分批调度,进一步提升利用率。

示例:高效batch与prompt写法

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": question}
    ]}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})

大batch推理:

outputs = llm.generate(inputs, sampling_params)

结论

  • batch_size/max_num_batched_tokens/max_num_seqs越大,吞吐越高,前提是显存足够。
  • prompt用chat_template多模态格式,图片用multi_modal_data传递,效率和精度都最佳。
  • 其它如dtype、chunked_prefill、prefix_caching等参数已是最优配置。

需要更详细的参数调优脚本或多进程并发示例吗?

Sources:

EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [dump_input.py:76] dtype=torch.bfloat16), field=MultiModalFlatField(slices=[[slice(0, 13, None)]], dim=0)), ‘image_num_patches’: MultiModalFieldElem(modality=‘image’, key=‘image_num_patches’, data=tensor(13), field=MultiModalBatchedField()), ‘image_token_id’: MultiModalFieldElem(modality=‘image’, key=‘image_token_id’, data=tensor(151667), field=MultiModalSharedField(batch_size=1))}],mm_hashes=[‘65c9121f6232805657d49990ea112953938a8b76851408ef7f2743e6ddff902b’],mm_positions=[PlaceholderRange(offset=43, length=3330, is_embed=tensor([False, True, True, …, True, True, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=1.0, repetition_penalty=1.0, temperature=1.0, top_p=0.1, top_k=1, min_p=0.0, seed=None, stop=, stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=, resumed_from_preemption=, new_token_ids=, new_block_ids=, num_computed_tokens=), num_scheduled_tokens={chatcmpl-573410b8caa343efb5546914bdebbcee: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-573410b8caa343efb5546914bdebbcee: [0]}, num_common_prefix_blocks=[16], finished_req_ids=, free_encoder_mm_hashes=, structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.004708652148322501, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=3404, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] Traceback (most recent call last):
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 711, in run_engine_core
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] engine_core.run_busy_loop()
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 738, in run_busy_loop
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] self._process_engine_step()
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 764, in _process_engine_step
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 292, in step
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] raise err
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] return model_fn(scheduler_output)
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py”, line 176, in execute_model
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] (output, ) = self.collective_rpc(
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py”, line 259, in collective_rpc
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] File “/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py”, line 243, in get_response
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] raise RuntimeError(
(EngineCore_DP0 pid=22694) ERROR 10-24 07:13:00 [core.py:720] RuntimeError: Worker failed with error ‘not enough values to unpack (expected 4, got 3)’, please check the stack trace above for the root cause
(Worker_TP2 pid=22832) INFO 10-24 07:13:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP2 pid=22832) INFO 10-24 07:13:00 [multiproc_executor.py:587] WorkerProc shutting down.
(Worker_TP1 pid=22831) INFO 10-24 07:13:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] AsyncLLM output_handler failed.
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] Traceback (most recent call last):
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] File “/vllm-workspace/vllm/vllm/v1/engine/async_llm.py”, line 444, in output_handler
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] outputs = await engine_core.get_output_async()
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] File “/vllm-workspace/vllm/vllm/v1/engine/core_client.py”, line 845, in get_output_async
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] raise self._format_exception(outputs) from None
(APIServer pid=22426) ERROR 10-24 07:13:00 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker_TP3 pid=22833) INFO 10-24 07:13:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP0 pid=22830) INFO 10-24 07:13:00 [multiproc_executor.py:546] Parent process exited, terminating worker
(APIServer pid=22426) INFO: 10.58.19.146:38158 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
(EngineCore_DP0 pid=22694) Process EngineCore_DP0:
(EngineCore_DP0 pid=22694) Traceback (most recent call last):
(EngineCore_DP0 pid=22694) File “/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP0 pid=22694) self.run()
(EngineCore_DP0 pid=22694) File “/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py”, line 108, in run
(EngineCore_DP0 pid=22694) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 722, in run_engine_core
(EngineCore_DP0 pid=22694) raise e
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 711, in run_engine_core
(EngineCore_DP0 pid=22694) engine_core.run_busy_loop()
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 738, in run_busy_loop
(EngineCore_DP0 pid=22694) self._process_engine_step()
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 764, in _process_engine_step
(EngineCore_DP0 pid=22694) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=22694) ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 292, in step
(EngineCore_DP0 pid=22694) model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=22694) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=22694) raise err
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/engine/core.py”, line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=22694) return model_fn(scheduler_output)
(EngineCore_DP0 pid=22694) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py”, line 176, in execute_model
(EngineCore_DP0 pid=22694) (output, ) = self.collective_rpc(
(EngineCore_DP0 pid=22694) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py”, line 259, in collective_rpc
(EngineCore_DP0 pid=22694) result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=22694) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=22694) File “/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py”, line 243, in get_response
(EngineCore_DP0 pid=22694) raise RuntimeError(
(EngineCore_DP0 pid=22694) RuntimeError: Worker failed with error ‘not enough values to unpack (expected 4, got 3)’, please check the stack trace above for the root cause