Qwen2.5 VL开启flashinfer失败

Qwen2.5 VL开启flashinfer失败,指令如下:

VLLM_ATTENTION_BACKEND=FLASHINFER \
TMPDIR=/msswift_tmp/tmp \
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
  --data-parallel-size 8 \
  --gpu-memory-utilization 0.95 \
  --port 9014 \
  --no-enable-prefix-caching \
  --enable-chunked-prefill \
  --distributed-executor-backend ray \
  --block-size 16 \
  --dtype bfloat16 \
  --kv-cache-dtype auto \
  --max-num-seqs 128 \
  --load-format safetensors

根据现有数据源,Qwen2.5-VL-3B-Instruct在vLLM中使用FLASHINFER作为注意力后端时,若未正确安装flashinfer库,会导致启动失败,报错如“‘NoneType’ object is not callable”或“Engine process failed to start”。需手动安装flashinfer,且需确保flashinfer版本与PyTorch和CUDA环境兼容,否则会出现类似问题。此外,部分模型和硬件环境下,flashinfer后端可能存在兼容性或精度问题,建议优先确认flashinfer已正确安装,并尝试切换到其他注意力后端(如FLASH_ATTN)以排查问题来源。

如果已安装flashinfer但仍失败,建议检查flashinfer版本、CUDA驱动、PyTorch版本是否匹配,并参考相关issue安装说明。如需进一步定位问题,可提供具体报错日志。需要更详细的排查步骤吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
报错日志如下:

2025-11-24 16:11:04,629 E 561378 590379] core_worker_process.cc:837: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 24x across cluster]
(EngineCore_DP7 pid=558751) (RayWorkerWrapper pid=561301) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] EngineCore failed to start.
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] Traceback (most recent call last):
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 829, in run_engine_core
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 1124, in init
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] super().init(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 606, in init
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] super().init(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 102, in init
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self.model_executor = executor_class(vllm_config)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/abstract.py”, line 101, in init
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self._init_executor()
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 97, in _init_executor
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self._init_workers_ray(placement_group)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 371, in _init_workers_ray
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self.collective_rpc(“load_model”)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 493, in collective_rpc
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/auto_init_hook.py”, line 22, in auto_init_wrapper
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] return fn(*args, **kwargs)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/client_mode_hook.py”, line 104, in wrapper
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] return func(*args, **kwargs)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/worker.py”, line 2972, in get
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/worker.py”, line 1031, in get_objects
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] raise value.as_instanceof_cause()
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=561355, ip=10.168.1.19, actor_id=b88510a5e87f5580878dafb701000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f91016e5ab0>)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 343, in execute_method
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] raise e
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_method
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] return run_method(self, method, args, kwargs)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 479, in run_method
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] return func(*args, **kwargs)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 273, in load_model
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3276, in load_model
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self.model = model_loader.load_model(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 49, in load_model
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] model = initialize_model(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py”, line 55, in initialize_model
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 1237, in init
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] self.visual = Qwen2_5_VisionTransformer(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 692, in init
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] raise RuntimeError(
(EngineCore_DP2 pid=558006) ERROR 11-24 16:11:15 [core.py:842] RuntimeError: Qwen2.5-VL does not support AttentionBackendEnum.FLASHINFER backend now.
(EngineCore_DP2 pid=558006) Process EngineCore_DP2:
(EngineCore_DP2 pid=558006) Traceback (most recent call last):
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP2 pid=558006) self.run()
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/multiprocessing/process.py”, line 108, in run
(EngineCore_DP2 pid=558006) self._target(*self._args, **self._kwargs)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 846, in run_engine_core
(EngineCore_DP2 pid=558006) raise e
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 829, in run_engine_core
(EngineCore_DP2 pid=558006) engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 1124, in init
(EngineCore_DP2 pid=558006) super().init(
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 606, in init
(EngineCore_DP2 pid=558006) super().init(
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 102, in init
(EngineCore_DP2 pid=558006) self.model_executor = executor_class(vllm_config)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/abstract.py”, line 101, in init
(EngineCore_DP2 pid=558006) self._init_executor()
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 97, in _init_executor
(EngineCore_DP2 pid=558006) self._init_workers_ray(placement_group)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 371, in _init_workers_ray
(EngineCore_DP2 pid=558006) self.collective_rpc(“load_model”)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 493, in collective_rpc
(EngineCore_DP2 pid=558006) return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/auto_init_hook.py”, line 22, in auto_init_wrapper
(EngineCore_DP2 pid=558006) return fn(*args, **kwargs)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/client_mode_hook.py”, line 104, in wrapper
(EngineCore_DP2 pid=558006) return func(*args, **kwargs)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/worker.py”, line 2972, in get
(EngineCore_DP2 pid=558006) values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/worker.py”, line 1031, in get_objects
(EngineCore_DP2 pid=558006) raise value.as_instanceof_cause()
(EngineCore_DP2 pid=558006) ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=561355, ip=10.168.1.19, actor_id=b88510a5e87f5580878dafb701000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f91016e5ab0>)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 343, in execute_method
(EngineCore_DP2 pid=558006) raise e
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_method
(EngineCore_DP2 pid=558006) return run_method(self, method, args, kwargs)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 479, in run_method
(EngineCore_DP2 pid=558006) return func(*args, **kwargs)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 273, in load_model
(EngineCore_DP2 pid=558006) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3276, in load_model
(EngineCore_DP2 pid=558006) self.model = model_loader.load_model(
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 49, in load_model
(EngineCore_DP2 pid=558006) model = initialize_model(
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py”, line 55, in initialize_model
(EngineCore_DP2 pid=558006) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 1237, in init
(EngineCore_DP2 pid=558006) self.visual = Qwen2_5_VisionTransformer(
(EngineCore_DP2 pid=558006) File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 692, in init
(EngineCore_DP2 pid=558006) raise RuntimeError(
(EngineCore_DP2 pid=558006) RuntimeError: Qwen2.5-VL does not support AttentionBackendEnum.FLASHINFER backend now.
(EngineCore_DP2 pid=558006) INFO 11-24 16:11:15 [ray_executor.py:121] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] Error executing method ‘load_model’. This might cause deadlock in distributed execution.
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] Traceback (most recent call last):
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_method
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] return run_method(self, method, args, kwargs)
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 479, in run_method
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] return func(*args, **kwargs)
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 273, in load_model
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3276, in load_model
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] self.model = model_loader.load_model(
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 49, in load_model
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] model = initialize_model(
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py”, line 55, in initialize_model
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 1237, in init
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] self.visual = Qwen2_5_VisionTransformer(
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 692, in init
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] raise RuntimeError(
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) ERROR 11-24 16:11:15 [worker_base.py:342] RuntimeError: Qwen2.5-VL does not support AttentionBackendEnum.FLASHINFER backend now.
(EngineCore_DP6 pid=558603) (RayWorkerWrapper pid=561740) Downloading Model from https://www.modelscope.cn to directory: /mnt/workspace/.cache/modelscope/models/Qwen/Qwen2.5-VL-3B-Instruct
(EngineCore_DP2 pid=558006) (EngineCore_DP6 pid=558603) (RayWorkerWrapper pid=561740) 2025-11-24 16:11:15,559 - modelscope - INFO - Target directory already exists, skipping creation.
(EngineCore_DP6 pid=558603) (pid=561375) [2025-11-24 16:11:04,626 E 561375 590323] core_worker_process.cc:837: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 7x across cluster]
(EngineCore_DP2 pid=558006) (RayWorkerWrapper pid=561355) [rank2]:[W1124 16:11:15.215572426 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.9 documentation (function operator())
(EngineCore_DP6 pid=558603) (RayWorkerWrapper pid=561740) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP1 pid=557850) (RayWorkerWrapper pid=561366) Downloading Model from https://www.modelscope.cn to directory: /mnt/workspace/.cache/modelscope/models/Qwen/Qwen2.5-VL-3B-Instruct
(EngineCore_DP3 pid=558153) (RayWorkerWrapper pid=561454) dsw-643752-b77cb64b8-b69sf:561454:591574 [0] NCCL INFO [Service thread] Connection closed by localRank 2
(EngineCore_DP3 pid=558153) (RayWorkerWrapper pid=561454) dsw-643752-b77cb64b8-b69sf:561454:591502 [0] NCCL INFO [Service thread] Connection closed by localRank 2
(EngineCore_DP1 pid=557850) (RayWorkerWrapper pid=561366) 2025-11-24 16:11:16,439 - modelscope - INFO - Target directory already exists, skipping creation.
(EngineCore_DP1 pid=557850) (pid=561443) [2025-11-24 16:11:04,676 E 561443 590620] core_worker_process.cc:837: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 26x across cluster]
(EngineCore_DP1 pid=557850) (RayWorkerWrapper pid=561366) dsw-643752-b77cb64b8-b69sf:561366:591562 [0] NCCL INFO [Service thread] Connection closed by localRank 2
(EngineCore_DP1 pid=557850) (RayWorkerWrapper pid=561366) dsw-643752-b77cb64b8-b69sf:561366:591510 [0] NCCL INFO [Service thread] Connection closed by localRank 2
(EngineCore_DP0 pid=557578) (RayWorkerWrapper pid=561527) dsw-643752-b77cb64b8-b69sf:561527:591566 [0] NCCL INFO [Service thread] Connection closed by localRank 2
(EngineCore_DP0 pid=557578) (RayWorkerWrapper pid=561527) dsw-643752-b77cb64b8-b69sf:561527:591496 [0] NCCL INFO [Service thread] Connection closed by localRank 2
(EngineCore_DP4 pid=558301) (EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] EngineCore failed to start.
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] Traceback (most recent call last):
(RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] Error executing method ‘load_model’. This might cause deadlock in distributed execution.(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 829, in run_engine_core
(EngineCore_DP4 pid=558301)
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] Traceback (most recent call last):ERROR 11-24 16:11:16 [core.py:842] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_DP4 pid=558301)
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_methodERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 1124, in init

(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] super().init(
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] return run_method(self, method, args, kwargs)
(EngineCore_DP4 pid=558301) (EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 479, in run_methodERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 606, in init
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] super().init(

(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] return func(*args, **kwargs)
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 273, in load_model(EngineCore_DP4 pid=558301)
ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 102, in init
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] self.model_runner.load_model(eep_scale_up=eep_scale_up)(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] self.model_executor = executor_class(vllm_config)
(EngineCore_DP4 pid=558301)
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3276, in load_modelERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/abstract.py”, line 101, in init
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] self._init_executor()

(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] self.model = model_loader.load_model(
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 49, in load_model
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] model = initialize_model(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 97, in _init_executor
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] self._init_workers_ray(placement_group)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 371, in _init_workers_ray
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py”, line 55, in initialize_model
(EngineCore_DP4 pid=558301) (EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] return model_class(vllm_config=vllm_config, prefix=prefix)ERROR 11-24 16:11:16 [core.py:842] self.collective_rpc(“load_model”)
(EngineCore_DP4 pid=558301)
ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/executor/ray_executor.py”, line 493, in collective_rpc
(EngineCore_DP4 pid=558301) (EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 1237, in __init__ERROR 11-24 16:11:16 [core.py:842] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/auto_init_hook.py”, line 22, in auto_init_wrapper

(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] self.visual = Qwen2_5_VisionTransformer(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] return fn(*args, **kwargs)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/client_mode_hook.py”, line 104, in wrapper
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 692, in init(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] return func(*args, **kwargs)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/worker.py”, line 2972, in get
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/ray/_private/worker.py”, line 1031, in get_objects

(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] raise RuntimeError((EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] raise value.as_instanceof_cause()
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=561265, ip=10.168.1.19, actor_id=fb144d0dbe310eb1fdb084af01000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7fc8e86f5ab0>)
(EngineCore_DP4 pid=558301)
(EngineCore_DP4 pid=558301) (RayWorkerWrapper pid=561265) ERROR 11-24 16:11:16 [worker_base.py:342] RuntimeError: Qwen2.5-VL does not support AttentionBackendEnum.FLASHINFER backend now.ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 343, in execute_method
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] raise e

(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_method
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] return run_method(self, method, args, kwargs)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 479, in run_method
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] return func(*args, **kwargs)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 273, in load_model
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3276, in load_model
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] self.model = model_loader.load_model(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 49, in load_model
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] model = initialize_model(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py”, line 55, in initialize_model
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 1237, in init
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] self.visual = Qwen2_5_VisionTransformer(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] File “/mnt/data/cpfs/miniconda3/envs/flashinfer/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py”, line 692, in init
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] raise RuntimeError(
(EngineCore_DP4 pid=558301) ERROR 11-24 16:11:16 [core.py:842] RuntimeError: Qwen2.5-VL does not support AttentionBackendEnum.FLASHINFER backend now.

根据日志,Qwen2.5-VL 当前不支持AttentionBackendEnum.FLASHINFER(即flashinfer后端),会直接抛出“Qwen2.5-VL does not support AttentionBackendEnum.FLASHINFER backend now.”的异常,导致模型无法加载和推理。这是vLLM官方代码层面明确限制,无法通过安装或配置解决,必须更换注意力后端。

建议将VLLM_ATTENTION_BACKEND环境变量设置为FLASH_ATTNXFORMERS,如:

export VLLM_ATTENTION_BACKEND=FLASH_ATTN

然后重新启动服务即可。需要进一步排查或详细解释原因吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
当前,有哪些模型支持FLASHINFER?

目前,vLLM 支持 FLASHINFER 后端的模型主要包括 Llama、Llama2、Llama3、Mistral、Mixtral、Qwen2、Qwen2-VL、Deepseek-V2、Gemma、Gemma2、Phi-3、InternLM、InternLM2、Starcoder2、GPT2、GPTBigCode、RW、OPT、MPT、Baichuan、Falcon、GPTNeoX、GPTJ、Bloom、MiniCPM、Olmo、Olmo2、Olmoe、Jamba、Arctic、Aquila、Phi3V、Phi3Small、StableLm、StableLMEpoch、Solar、Persimmon、Orion、DeciLM、Dbrx、Exaone、Nemotron、Granite、GraniteMoe、FalconMamba、PhiMoE、Qwen2Moe、Qwen2ForRewardModel、Qwen2ForSequenceClassification、Bert、Roberta、XLMRoberta、Florence2、Bart、Blip2、Chameleon、ChatGLM、Fuyu、H2OVL、InternVL、Idefics3、Llava、LlavaNext、LlavaOnevision、Mamba、Medusa、MLPSpeculator、NVLM_D、PaliGemma、Pixtral、Ultravox、EAGLE、Molmo、Mllama、Aria、MiniCPMV、Qwen2Audio、Qwen2VL、Qwen2_5_VL、Qwen2_5OmniThinker、Qwen3VL、Qwen3VLMoe、Qwen3OmniMoeThinker、RForConditionalGeneration、SkyworkR1VChatModel、SmolVLMForConditionalGeneration、Step3VLForConditionalGeneration、TarsierForConditionalGeneration、Tarsier2ForConditionalGeneration、Emu3ForConditionalGeneration 等等,但具体支持情况需结合模型架构和vLLM版本,部分多模态模型如 Qwen2.5-VL 目前不支持 FLASHINFER 后端,会直接报错。

详细支持列表可参考官方文档。需要具体某个模型的支持情况吗?

Sources:

Hint: Mention @RunLLM in the post for followups.