Trying to run gemma-3-27b-it-FP8-dynamic with rocm

File “/root/.cache/vllm/torch_compile_cache/673bdabe01/rank_0_0/inductor_cache/pc/cpcqngxnt2x5gu45zk774uoirkal6qycvskya6cnvcodhr7ptwsq.py”, line 2203, in call
(VllmWorker rank=0 pid=16355) ERROR 07-25 17:02:53 [multiproc_executor.py:522] extern_kernels._scaled_mm(buf0, arg4_1, buf2, reinterpret_tensor(arg3_1, (1, 4096), (1, 1), 0), out_dtype=torch.bfloat16, use_fast_accum=False, out=buf7)
(VllmWorker rank=0 pid=16355) ERROR 07-25 17:02:53 [multiproc_executor.py:522] RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+
(VllmWorker rank=0 pid=16355) ERROR 07-25 17:02:53 [multiproc_executor.py:522]
ERROR 07-25 17:02:53 [core.py:519] EngineCore failed to start.
ERROR 07-25 17:02:53 [core.py:519] Traceback (most recent call last):
ERROR 07-25 17:02:53 [core.py:519] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 510, in run_engine_core
ERROR 07-25 17:02:53 [core.py:519] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-25 17:02:53 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The error “torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+” means your GPU does not meet the minimum compute capability required for FP8 operations in vLLM. Only NVIDIA GPUs with compute capability 8.9 (Ada Lovelace), 9.0 (Hopper), or AMD MI300+ are supported for these operations. Lower compute capability GPUs (e.g., Ampere, Turing, or older) are not supported for torch._scaled_mm and will trigger this error. There is no workaround in vLLM for running FP8 models on unsupported hardware; you must use a supported GPU or switch to a model/quantization that does not require FP8 kernels.

For more details, see the discussion in vllm issue #10663.

Would you like a step-by-step explanation or hardware compatibility details?

Sources:

so can I run this model at all with a 2x 7900 XTX setup. how to change this command:

vllm serve /workspace/models/gemma-3-27b-it-FP8-dynamic/
–tensor-parallel-size 2
–trust-remote-code
–kv-cache-dtype auto
–dtype auto
–max-model-len 4096
–max-num-seqs 2
–port 8000
–host 0.0.0.0

No, you cannot run FP8 models like gemma-3-27b-it-FP8-dynamic on 2x 7900 XTX, because FP8 (torch._scaled_mm) is only supported on AMD MI300+ or NVIDIA Hopper/Ada GPUs. The 7900 XTX (RDNA3) is not supported for FP8 inference in vLLM, and you will encounter the same error as before. There is no command-line change that will enable FP8 model inference on this hardware—use a non-FP8 (e.g., FP16 or INT4) version of the model instead.

See vllm issue #10663 and FP8 quantization docs for details.

Would you like guidance on running a different quantization or model on your GPUs?

Sources:

can you suggest which Gemma3 model I could run?

This is the only one I can run: gemma-3n-E4B-it
But its too small.

It seems likely that on your 2x 7900 XTX setup, you are limited to models that fit within your combined VRAM and do not require FP8 support. The Gemma3 4B and 12B models may be possible if you use a non-FP8 quantization (such as bfloat16 or int4/8), but the 27B model is likely too large for your GPUs unless you use aggressive quantization and a small context window. FP8 and some quantized models are not supported on your hardware, as discussed in vllm issue #10663.

My understanding is that you should try gemma-3-4b-it or gemma-3-12b-it in bfloat16 or int4/8 quantization, with reduced --max-model-len and --max-num-seqs to fit within your VRAM. Avoid any model or quantization labeled FP8 or requiring torch._scaled_mm. See supported models and conserving memory for more.

Would you like a sample command for running gemma-3-4b-it or 12b-it on your setup?

Sources:

yes give me sample code

You can try running Gemma3 4B in bfloat16 (or float16) on your 2x 7900 XTX with the following command, which avoids FP8 and uses a small context window to fit VRAM:

vllm serve /workspace/models/gemma-3-4b-it \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 2048 \
  --max-num-seqs 2 \
  --port 8000 \
  --host 0.0.0.0

If you encounter OOM, try reducing --max-model-len further. Do not use --kv-cache-dtype fp8 or any FP8 model. For more, see conserving memory.

Would you like a sample for int4 quantization or for the 12B model?

Sources:

Will that support image input?

Yes, gemma-3-4b-it supports image input in vLLM, but with limitations. Both V0 and V1 engines support text-only inputs; for text+image, V0 implements the correct attention pattern, while V1 uses a simplified pattern that may reduce output quality, especially for complex image tasks. Image input is not supported on TPU, and you must avoid FP8 quantization. See supported models and multimodal input details.

Would you like step-by-step instructions for sending image input to gemma-3-4b-it?

Sources:

I need gemma3 model which I can server trough openai API and send images to it from my app. The app already knows how to send images I guess.

You can use gemma-3-4b-it with vLLM and serve it through the OpenAI-compatible API, supporting image input. Make sure to use a non-FP8 quantization (e.g., bfloat16) and follow the OpenAI Vision API format for sending images (as your app already does). See multimodal input docs for details.

Would you like a full example of launching the server and a sample API request?

Sources:

What If I add 2 more 7900 XTX cards, can I then run 27B model?

It seems likely that adding 2 more 7900 XTX cards (total 4 GPUs) would allow you to run the Gemma3 27B model, provided you use tensor parallelism (--tensor-parallel-size 4) and possibly reduce --max-model-len and --max-num-seqs to fit within available VRAM. However, you must still avoid FP8 quantization, as RDNA3 GPUs do not support it—use bfloat16 or int4/8 quantized models instead. See conserving memory and parallelism strategies.

Would you like a sample launch command for 4 GPUs and tips for memory tuning?

Sources:

I am running now
vllm serve /workspace/models/gemma-3-12b-it/
–tensor-parallel-size 2
–trust-remote-code
–kv-cache-dtype auto
–dtype bfloat16
–max-model-len 2048
–max-num-seqs 2
–port 8000
–host 0.0.0.0

And with openwebui its all messy and no sense in the chat

If gemma-3-12b-it outputs messy or nonsensical text in Open WebUI, the most common cause is using an unsupported or unstable dtype (like float16) or a hardware/quantization mismatch. For Gemma3, float16 is known to cause numerical instability—bfloat16 or float32 is recommended. However, if you are already using bfloat16 and still see issues, it may be due to hardware compatibility, quantization, or a bug in vLLM or the model version. See issue #18733 and issue #15097.

Try switching to float32 (--dtype float32) if VRAM allows, and ensure you are not using FP8 or float16. If the problem persists, test with a smaller model (e.g., gemma-3-4b-it) to isolate the issue. Also, confirm your vLLM and transformers versions are up to date.

Would you like step-by-step troubleshooting or a sample float32 command?

Sources:

how to confirm versions are up to date

To confirm your vLLM and transformers versions are up to date, run the following commands in your environment:

pip show vllm
pip show transformers

Compare the output versions to the latest releases on vLLM PyPI and transformers PyPI. For best compatibility with new models, vLLM 0.9.2+ and transformers 4.53.2+ are recommended (source).

Would you like instructions for upgrading these packages?

Sources:

how can I upgrade my transformers ?

I have now: Name: transformers
Version: 4.53.0

should be 4.53.2 at least

To upgrade transformers to at least 4.53.2, run:

pip install --upgrade transformers==4.53.2

This will install the required version. For more, see vllm issue #21034.

Would you like instructions for upgrading vLLM as well?

Sources: