Running gat model with rocm

Hi,

Running vllm on ubuntu 24.04 rocm 7.1.1 and 7900 XTX

docker run -it --dns=192.168.49.1 --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri --shm-size=16g -e HUGGING_FACE_HUB_TOKEN=“” -e VLLM_SLEEP_WHEN_IDLE=1 -e ROCM_VISIBLE_DEVICES=0,1 -e HIP_VISIBLE_DEVICES=0,1 -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e PYTORCH_ROCM_ARCH=“gfx1100” -e VLLM_USE_TRITON_FLASH_ATTN=0 -e GPU_MAX_HW_QUEUES=1 -e NCCL_DEBUG=WARN -e NCCL_IB_DISABLE=1 --restart unless-stopped --name vllm_rocm_gemma-3-27b-it-qat-W4A16-G128_tp1 -v /home/ubuntu/vllm_models:/root/.cache/huggingface rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006 vllm serve leon-se/gemma-3-27b-it-qat-W4A16-G128 --host 0.0.0.0 --port 8000 --enforce-eager --served-model-name vllm/gemma-3 --tensor-parallel-size 1 --trust-remote-code --dtype bfloat16 --kv-cache-dtype auto --max-model-len 4096 --max-num-seqs 4 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.92 --swap-space 12 --disable-log-requests --disable-log-stats --max-log-len 100

During inference bench its very slow and gives these:

Is this model compatible with 7900 XTx?

(EngineCore_DP0 pid=127) [rank0]:W1128 23:47:07.358000 127 torch/_dynamo/convert_frame.py:1097] [0/8] torch._dynamo hit config.recompile_limit (8)
(EngineCore_DP0 pid=127) [rank0]:W1128 23:47:07.358000 127 torch/_dynamo/convert_frame.py:1097] [0/8] function: ‘forward_static’ (/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/layernorm.py:372)
(EngineCore_DP0 pid=127) [rank0]:W1128 23:47:07.358000 127 torch/_dynamo/convert_frame.py:1097] [0/8] last reason: 0/7: expected type of ‘residual’ to be a tensor type, ’ but found <class ‘NoneType’>
(EngineCore_DP0 pid=127) [rank0]:W1128 23:47:07.358000 127 torch/_dynamo/convert_frame.py:1097] [0/8] To log all recompilation reasons, use TORCH_LOGS=“recompiles”.
(EngineCore_DP0 pid=127) [rank0]:W1128 23:47:07.358000 127 torch/_dynamo/convert_frame.py:1097] [0/8] To diagnose recompilation issues, see torch.compile Troubleshooting — PyTorch main documentation
(APIServer pid=1) INFO: 127.0.0.1:48138 - “POST /v1/completions HTTP/1.1” 200 OK

The Gemma-3-27B QAT W4A16 model is not officially listed as supported on the AMD Radeon RX 7900 XTX (gfx1100) with ROCm in vLLM documentation. While vLLM does support ROCm 6.3+ and the 7900 series, many large models (especially quantized or QAT variants) are primarily validated on MI200/MI300/Instinct series, and issues like slow inference and repeated “torch._dynamo hit config.recompile_limit” warnings are common on consumer GPUs due to limited support for advanced attention backends and kernel recompilation overheads. The warnings you see indicate excessive recompilation and possible incompatibility or suboptimal performance on this hardware, as also reported in similar vLLM issues and PyTorch troubleshooting docs.

For best performance and compatibility, use MI200/MI300 or Instinct series GPUs. On 7900 XTX, you may experience slowdowns and limited support for some models and features. You can try reducing model size, using a different model, or setting TORCH_LOGS="recompiles" for more diagnostics, but full compatibility is not guaranteed for this model on your GPU according to official vLLM ROCm docs.

Would you like more detail on troubleshooting or alternative models for your hardware?

Sources:

Hint: Mention @RunLLM in the post for followups.

okey so what gemma 27B model I am able to run best with 2x 7900 XTX which means 48gb vram.@RunLLM