vLLM v0.15.1 failing when deployed in AWS

Not sure which category to use for this question. We have been running vllm up to v0.14.1 on a p5en.48xlarge in AWS on our EKS cluster, serving gpt-oss-120b. However, when we tried to push v0.15.0 and v0.15.1, we were unsuccessful. I found two open issues ([Bug]: Issue with vllm 0.15.0 image - running via docker · Issue #33447 · vllm-project/vllm · GitHub and [Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine · Issue #33369 · vllm-project/vllm · GitHub) similar to what we are seeing, but the workaround in the second one did not help. We have upgraded our instance AMI to the latest as of this morning, hoping it might be a driver issue, but that did not resolve things. Here is a snippet of the error we see on startup.

ERROR 01-30 12:09:33 [multiproc_executor.py:772] WorkerProc failed to start.
ERROR 01-30 12:09:33 [multiproc_executor.py:772] Traceback (most recent call last):
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     worker = WorkerProc(*args, **kwargs)
ERROR 01-30 12:09:33 [multiproc_executor.py:772]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 569, in __init__
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     self.worker.init_device()
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     self.worker.init_device()  # type: ignore
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in init_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     current_platform.set_device(self.device)
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 123, in set_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     torch.cuda.set_device(device)
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 567, in set_device
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     torch._C._cuda_setDevice(device)
ERROR 01-30 12:09:33 [multiproc_executor.py:772]   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 410, in _lazy_init
ERROR 01-30 12:09:33 [multiproc_executor.py:772]     torch._C._cuda_init()
ERROR 01-30 12:09:33 [multiproc_executor.py:772] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination

Would anyone be able to provide us with some direction on how we might resolve this issue? Our Docker image is using this base: vllm/vllm-openai:v0.15.1

Thank you!

Your error (Error 803: system has unsupported display driver / cuda driver combination) is a known issue with vLLM v0.15.x Docker images on newer NVIDIA drivers (e.g., Blackwell GPUs, CUDA 13.x). The root cause is a conflict between the container’s CUDA compatibility libraries and the host’s driver, which leads to the wrong libcuda.so being loaded. The workaround in issue #33369 and issue #33447 is to remove or override the /etc/ld.so.conf.d/00-cuda-compat.conf file in the container, but this does not always work with the latest images or drivers.

A more robust solution is to explicitly set the LD_LIBRARY_PATH in your Docker container to prioritize the host’s NVIDIA driver libraries and exclude the container’s CUDA toolkit paths. For example, set:

LD_LIBRARY_PATH: '/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu'

in your container environment, and ensure you do not mount /tmp/nvidia-mps if using multiprocessing with spawn (the new default in v0.15.x). This approach has been confirmed to resolve the error on Blackwell/590+ drivers and CUDA 13.x (see details). If this does not resolve your issue, you may need to wait for a backported fix in a future vLLM release.

Would you like a step-by-step breakdown or a sample Docker Compose snippet?

Sources:

Hint: Mention @RunLLM in the post for followups.

I have tried many combinations of LD_LIBRARY_PATH overrides and have not found a working path yet. Some more details:

AWS Instance Type: p5en.48xlarge
AMI Type: BOTTLEROCKET_x86_64_NVIDIA
AMI Version: 1.54.0-5043decc

Follow-up: the nightly build works fine (version 0.15.2rc1.dev), so this appears to be addressed. Will use this til the next release.