Not sure which category to use for this question. We have been running vllm up to v0.14.1 on a p5en.48xlarge in AWS on our EKS cluster, serving gpt-oss-120b. However, when we tried to push v0.15.0 and v0.15.1, we were unsuccessful. I found two open issues ([Bug]: Issue with vllm 0.15.0 image - running via docker · Issue #33447 · vllm-project/vllm · GitHub and [Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine · Issue #33369 · vllm-project/vllm · GitHub) similar to what we are seeing, but the workaround in the second one did not help. We have upgraded our instance AMI to the latest as of this morning, hoping it might be a driver issue, but that did not resolve things. Here is a snippet of the error we see on startup.
ERROR 01-30 12:09:33 [multiproc_executor.py:772] WorkerProc failed to start.
ERROR 01-30 12:09:33 [multiproc_executor.py:772] Traceback (most recent call last):
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
ERROR 01-30 12:09:33 [multiproc_executor.py:772] worker = WorkerProc(*args, **kwargs)
ERROR 01-30 12:09:33 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 569, in __init__
ERROR 01-30 12:09:33 [multiproc_executor.py:772] self.worker.init_device()
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772] self.worker.init_device() # type: ignore
ERROR 01-30 12:09:33 [multiproc_executor.py:772] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in init_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772] current_platform.set_device(self.device)
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 123, in set_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772] torch.cuda.set_device(device)
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 567, in set_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772] torch._C._cuda_setDevice(device)
ERROR 01-30 12:09:33 [multiproc_executor.py:772] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 410, in _lazy_init
ERROR 01-30 12:09:33 [multiproc_executor.py:772] torch._C._cuda_init()
ERROR 01-30 12:09:33 [multiproc_executor.py:772] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination
Would anyone be able to provide us with some direction on how we might resolve this issue? Our Docker image is using this base: vllm/vllm-openai:v0.15.1
Thank you!