用ray起 pp8 tp2，报下面的错误，请问下是什么原因

yangshanjun · November 26, 2025, 4:04am

(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] File “/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 461, in worker_main
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] File “/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py”, line 357, in init
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] self.worker.init_device()
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] File “/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py”, line 666, in init_device
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] self.worker.init_device() # type: ignore
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] File “/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py”, line 135, in init_device
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] torch.cuda.set_device(self.device)
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] File “/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py”, line 478, in set_device
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] torch._C._cuda_setDevice(device)
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] RuntimeError: HIP error: invalid device ordinal
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] For debugging consider passing AMD_SERIALIZE_KERNEL=3
(VllmWorker rank=12 pid=42139) ERROR 11-26 12:03:59 [multiproc_executor.py:487] Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

RunLLM · November 26, 2025, 4:05am

The error HIP error: invalid device ordinal indicates that vLLM is trying to access a GPU device index that does not exist or is not available. This is often caused by a mismatch between the number of GPUs specified (e.g., via tensor_parallel_size) and the actual available GPUs, or by not setting the CUDA_VISIBLE_DEVICES (or for AMD, the equivalent ROCm variable) environment variable correctly. This can also happen if you request more parallelism than available devices, or if device IDs are not mapped as expected by the runtime. For ROCm/AMD, ensure your environment and device visibility are set up properly and that the number of GPUs matches your configuration. See similar issues and troubleshooting steps in this discussion and this one.

Would you like a step-by-step guide to check and fix your device configuration?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
用ray启动pp8 tp2的时候，卡在了下面这个地方，请问下可能是什么原因 General	5	158	November 28, 2025
多机多卡推理 ray vllm遇到的报错 General	1	91	January 23, 2026
RuntimeError: CUDA driver error: invalid device ordinal after the update to v0.11.0 General	5	583	October 27, 2025
Command lines to run pp=2 on two nodes in vllm 0.10.x General	2	112	November 20, 2025
Help for error when run vllm with tensor parallel General	7	2047	May 21, 2025

用ray起 pp8 tp2，报下面的错误，请问下是什么原因

Related topics