2卡16节点(每个节点8张卡),启动Qwen3-235B-A22B的推理,
集群为:
root@root-1-241:/data/llm/app/Qwen3-235B-A22B# ray status
======== Autoscaler status: 2026-01-23 15:03:17.631346 ========
Node status
---------------------------------------------------------------
Active:
1 node_60b782f567508b2a8c54c54cb54bb6f6c4e68e62cbe366664003069e
1 node_11657f380647113b616b9b934e410a3daab898575243e4c96a7a02cf
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/256.0 CPU
0.0/16.0 GPU
0B/1.78TiB memory
0B/190.00GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
root@root-1-241:/data/llm/app/Qwen3-235B-A22B#
服务命令为:
nohup vllm serve /data/llm/models/Qwen3-235B-A22B -pp 2 -tp 8 --trust-remote-code --distributed-executor-backend ray --dtype bfloat16 --swap-space 16 --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 11235 --max-model-len 20000 --reasoning-parser qwen3 --tool-call-parser hermes --enable-auto-tool-choice > vllm-$(date +%Y%m%d%H%M).log 2>&1 &
遇到报错,如下:
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] EngineCore failed to start.
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] Traceback (most recent call last):
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 833, in run_engine_core
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 606, in _init_
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] super()._init_(
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 102, in _init_
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 101, in _init_
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] self._init_executor()
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/ray_executor.py”, line 97, in _init_executor
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] self._init_workers_ray(placement_group)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/ray_executor.py”, line 370, in _init_workers_ray
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] self.collective_rpc(“init_device”)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/ray_executor.py”, line 493, in collective_rpc
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/ray/_private/auto_init_hook.py”, line 22, in auto_init_wrapper
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] return fn(*args, **kwargs)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/ray/_private/client_mode_hook.py”, line 104, in wrapper
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] return func(*args, **kwargs)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py”, line 2858, in get
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py”, line 958, in get_objects
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] raise value.as_instanceof_cause()
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ray.exceptions.RayTaskError(AcceleratorError): ray::RayWorkerWrapper.execute_method() (pid=974, ip=192.168.205.2
5, actor_id=d3849d9ac13f4515a54d8f8502000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7faa41db23f0>)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py”, line 343, in execute_method
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] raise e
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_method
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/serial_utils.py”, line 479, in run_method
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] return func(*args, **kwargs)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py”, line 324, in init_device
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 216, in init_device
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] current_platform.set_device(self.device)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/vllm_metax/platform.py”, line 148, in set_device
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] torch.cuda.set_device(device)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] File “/opt/conda/lib/python3.12/site-packages/torch/cuda/_init_.py”, line 570, in set_device
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] torch._C._cuda_setDevice(device)
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] torch.AcceleratorError: CUDA error: invalid device ordinal
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=9433) ERROR 01-23 14:46:01 [core.py:842] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.