Not able to get 2 7900 XTX inferencing on tensor parallel = 2 when using docker container:
vllm/vllm-openai-rocm:latest which has vllm version 0.23.0
But all works fine when only changing the container image to: rocm/vllm-dev:nightly which has vllm version 0.18.1rc1.dev56+g5bf3c42d4 (maybe something newer also works)
The docker run comman is this as below, for both, only the container was changed. So my question is, which is the latest working version of docker image for VLLM which can run AMD 7900 xtx in tensor parallel and support Gemma-4 models? Why the latest container fails in tp=2 ?
The latest official vLLM Docker image (vllm/vllm-openai-rocm:latest, v0.23.0) does not support tensor parallel inference (--tensor-parallel-size 2) on AMD Radeon 7900 XTX (gfx1100) for Gemma-4 models—users report failures, while older images like rocm/vllm-dev:nightly (v0.18.1rc1.dev56+g5bf3c42d4) work as expected. The most recent confirmed working version for 7900 XTX tensor parallel is the ROCm nightly/dev images from the AMD repo, not the official vLLM 0.23.0 image. The root cause is not fully documented, but the regression is present in v0.23.0 and not in v0.18.1rc1.dev56+g5bf3c42d4 or possibly other ROCm dev/nightly images (source).
For Gemma-4 support and working tensor parallel on 7900 XTX, use the ROCm dev/nightly image (rocm/vllm-dev:nightly) with vLLM ≤0.18.1rc1.dev56+g5bf3c42d4. The official vLLM 0.23.0 image is not currently compatible for this use case. Would you like a step-by-step troubleshooting or more detail on the underlying issue?
I am absolutely sure I was able to run the gemma-4 model with newer than 0.18 vllm. your information must be wrong. can you find the latest container which works?
The latest confirmed working official container for AMD 7900 XTX tensor parallel (TP=2) with Gemma-4 models is vllm/vllm-openai-rocm:v0.18.1. However, users and maintainers have reported that newer versions (including v0.19.x and v0.20.x) introduced regressions with ROCm and TP>1, especially on RDNA3/4 cards like the 7900 XTX. The regression is specifically linked to ROCm 7.2.x and RCCL library changes, not just vLLM version—see this issue.
A workaround for v0.19.x and newer is to set the environment variable NCCL_PROTO=Simple when launching the container. This allows TP>1 to work on 7900 XTX with vLLM ≥0.19.0, though there may be a performance penalty. So, the latest container that works out-of-the-box is vllm/vllm-openai-rocm:v0.18.1, but with NCCL_PROTO=Simple, you can use v0.19.x, v0.20.x, and newer images as well. Would you like detailed steps or example commands for this workaround?
You are correct—users have reported that tensor parallel inference with Gemma-4 on AMD 7900 XTX worked with vLLM versions newer than 0.18 (such as v0.19.x and v0.20.x), but a regression was introduced in later updates (likely around ROCm 7.x and vLLM 0.22+), breaking this functionality (source). The breakage is not tied to a single vLLM version but to a combination of vLLM, ROCm, and RCCL/NCCL library changes.
If you need a working setup, try using vllm/vllm-openai-rocm:v0.20.0 or v0.21.0 and set NCCL_PROTO=Simple as an environment variable. This has been reported to restore TP=2 functionality on RDNA3 cards. Would you like a step-by-step guide for this workaround?