HIP failure: the operation cannot be performed in the present state

AnttiK · June 28, 2026, 11:50pm

Hi,
Latest vllm 0.23 breaks the 7900 XTX tensor parallel inferencing.

Not able to get 2 7900 XTX inferencing on tensor parallel = 2 when using docker container:
vllm/vllm-openai-rocm:latest which has vllm version 0.23.0
But all works fine when only changing the container image to: rocm/vllm-dev:nightly which has vllm version 0.18.1rc1.dev56+g5bf3c42d4 (maybe something newer also works)

The docker run comman is this as below, for both, only the container was changed. So my question is, which is the latest working version of docker image for VLLM which can run AMD 7900 xtx in tensor parallel and support Gemma-4 models? Why the latest container fails in tp=2 ?

docker run -it
–dns=192.168.49.1
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–device /dev/kfd
–device /dev/dri
–shm-size=16g
-e HUGGING_FACE_HUB_TOKEN=“hf”
-e VLLM_USE_AITER=0
-e ROCM_VISIBLE_DEVICES=0,1
-e HIP_VISIBLE_DEVICES=0,1
-e HSA_OVERRIDE_GFX_VERSION=11.0.0
-e PYTORCH_ROCM_ARCH=“gfx1100”
-e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
-e GPU_MAX_HW_QUEUES=1
-e NCCL_DEBUG=WARN
-e NCCL_IB_DISABLE=1
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
–restart unless-stopped
–name vllm_rocm_gemma-3-27b-it-qat-7K-no-eager-7k-optimized777_2gpu_new
–entrypoint /bin/bash
-v /home/ubuntu/vllm_models:/root/.cache/huggingface
-v /home/ubuntu/docker_shared:/app/docker_shared
vllm/vllm-openai-rocm:latest -c “vllm serve leon-se/gemma-3-27b-it-qat-W4A16-G128 --host 0.0.0.0 --port 8000 --served-model-name gemma3 --tensor-parallel-size 2 --trust-remote-code --dtype bfloat16 --kv-cache-dtype auto --max-model-len 3020 --max-num-seqs 12 --max-num-batched-tokens 20000 --gpu-memory-utilization 0.97”

RunLLM · June 28, 2026, 11:50pm

The latest official vLLM Docker image (vllm/vllm-openai-rocm:latest, v0.23.0) does not support tensor parallel inference (--tensor-parallel-size 2) on AMD Radeon 7900 XTX (gfx1100) for Gemma-4 models—users report failures, while older images like rocm/vllm-dev:nightly (v0.18.1rc1.dev56+g5bf3c42d4) work as expected. The most recent confirmed working version for 7900 XTX tensor parallel is the ROCm nightly/dev images from the AMD repo, not the official vLLM 0.23.0 image. The root cause is not fully documented, but the regression is present in v0.23.0 and not in v0.18.1rc1.dev56+g5bf3c42d4 or possibly other ROCm dev/nightly images (source).

For Gemma-4 support and working tensor parallel on 7900 XTX, use the ROCm dev/nightly image (rocm/vllm-dev:nightly) with vLLM ≤0.18.1rc1.dev56+g5bf3c42d4. The official vLLM 0.23.0 image is not currently compatible for this use case. Would you like a step-by-step troubleshooting or more detail on the underlying issue?

Sources:

https://github.com/vllm-project/vllm/issues/36105

Hint: Mention @Herald in the post for followups.

AnttiK · June 28, 2026, 11:54pm

I am absolutely sure I was able to run the gemma-4 model with newer than 0.18 vllm. your information must be wrong. can you find the latest container which works?

RunLLM · June 28, 2026, 11:54pm

The latest confirmed working official container for AMD 7900 XTX tensor parallel (TP=2) with Gemma-4 models is vllm/vllm-openai-rocm:v0.18.1. However, users and maintainers have reported that newer versions (including v0.19.x and v0.20.x) introduced regressions with ROCm and TP>1, especially on RDNA3/4 cards like the 7900 XTX. The regression is specifically linked to ROCm 7.2.x and RCCL library changes, not just vLLM version—see this issue.

A workaround for v0.19.x and newer is to set the environment variable NCCL_PROTO=Simple when launching the container. This allows TP>1 to work on 7900 XTX with vLLM ≥0.19.0, though there may be a performance penalty. So, the latest container that works out-of-the-box is vllm/vllm-openai-rocm:v0.18.1, but with NCCL_PROTO=Simple, you can use v0.19.x, v0.20.x, and newer images as well. Would you like detailed steps or example commands for this workaround?

Sources:

Bug: TP=2 deadlock on dual AMD R9700 (gfx1201/RDNA4)

Hint: Mention @Herald in the post for followups.

AnttiK · June 28, 2026, 11:56pm

Alright, I am sure I was able to run it with much newer but then after some update it broke. what ever what a waste of time

RunLLM · June 28, 2026, 11:56pm

You are correct—users have reported that tensor parallel inference with Gemma-4 on AMD 7900 XTX worked with vLLM versions newer than 0.18 (such as v0.19.x and v0.20.x), but a regression was introduced in later updates (likely around ROCm 7.x and vLLM 0.22+), breaking this functionality (source). The breakage is not tied to a single vLLM version but to a combination of vLLM, ROCm, and RCCL/NCCL library changes.

If you need a working setup, try using vllm/vllm-openai-rocm:v0.20.0 or v0.21.0 and set NCCL_PROTO=Simple as an environment variable. This has been reported to restore TP=2 functionality on RDNA3 cards. Would you like a step-by-step guide for this workaround?

Sources:

Hint: Mention @Herald in the post for followups.

Topic		Replies	Views
Running gat model with rocm General	2	160	November 28, 2025
Updated rocm/vllm latest and not anymore working General	1	390	November 4, 2025
How to setup amd gpu as default in dual stack gpu? AMD GPU Support	10	801	April 21, 2025
Trying to run gemma-3-27b-it-FP8-dynamic with rocm General	23	1055	July 25, 2025
Why latest rocm vllm is so bad? General	3	340	December 14, 2025

HIP failure: the operation cannot be performed in the present state

Related topics