How to run GGUF with vLLM and ROCM

AnttiK · March 1, 2026, 5:36pm

Getting Value error, torch.bfloat16 is not supported for quantization method gguf. Supported dtypes: [torch.float16, torch.float32] [type=value_error, input_value=ArgsKwargs((), {‘model_co…transfer_config’: None}), input_type=ArgsKwargs]

docker run -it --dns=192.168.49.1 --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri --shm-size=16g -e HUGGING_FACE_HUB_TOKEN=“ -e VLLM_SLEEP_WHEN_IDLE=1 -e ROCM_VISIBLE_DEVICES=0,1 -e HIP_VISIBLE_DEVICES=0,1 -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e PYTORCH_ROCM_ARCH=“gfx1100” -e VLLM_USE_TRITON_FLASH_ATTN=0 -e GPU_MAX_HW_QUEUES=1 -e NCCL_DEBUG=WARN -e NCCL_IB_DISABLE=1 --restart unless-stopped --name vllm_rocm_gemma-3-27b-it-qat-q4_0-gguf2 -v /home/ubuntu/vllm_models:/models rocm/vllm-dev:nightly vllm serve /models/gemma-3-27b-it-Q5_K_M.gguf --host 0.0.0.0 --port 8000 --enforce-eager --served-model-name gemma3 --tensor-parallel-size 2 --trust-remote-code --dtype auto --kv-cache-dtype auto --max-model-len 1024 --max-num-seqs 12 --max-num-batched-tokens 4000 --gpu-memory-utilization 0.90 --swap-space 12 --disable-log-requests --disable-log-stats --max-log-len 100

RunLLM · March 1, 2026, 5:37pm

The error occurs because torch.bfloat16 is not supported for GGUF quantized models in vLLM; only torch.float16 and torch.float32 are allowed. You must explicitly set --dtype float16 or --dtype float32 when serving GGUF models to avoid this error—auto may select an unsupported dtype depending on your hardware and model config.

According to vllm.model_executor.layers.quantization.gguf documentation, GGUF quantization supports only float16 and float32 activation dtypes. Please update your command to include --dtype float16 or --dtype float32.

Would you like more detail or a sample command?

Sources:

vllm.model_executor.layers.quantization.gguf — vLLM

Hint: Mention @RunLLM in the post for followups.

AnttiK · March 1, 2026, 5:40pm

Value error, The model type ‘gemma3_text’ does not support float16. Reason: Numerical instability. Please use bfloat16 or float32 instead. [type=value_error, input_value=ArgsKwargs((), {‘model’: …rocessor_plugin’: None}), input_type=ArgsKwargs]

@RunLLM

RunLLM · March 1, 2026, 5:40pm

The error means the model type gemma3_text is known to be numerically unstable with float16; you must use bfloat16 or float32 as the dtype. This is enforced in vLLM for Gemma3 models due to precision issues with float16, as documented in the code and issues (see _FLOAT16_NOT_SUPPORTED_MODELS).

Set --dtype bfloat16 or --dtype float32 when serving or using this model to resolve the error.

Would you like more detail or example commands?

Sources:

Hint: Mention @RunLLM in the post for followups.

AnttiK · March 1, 2026, 5:50pm

So it looks like its not possible to run GGUF models with ROCM and 7900 XTX.

Topic		Replies	Views
How to run GGUF with rocm and 7900 xtx General	5	365	December 14, 2025
Why latest rocm vllm is so bad? General	3	333	December 14, 2025
Running gat model with rocm General	2	151	November 28, 2025
Updated rocm/vllm latest and not anymore working General	1	385	November 4, 2025
Gemma-3-QAT fp4 with rocm General	3	316	October 13, 2025

How to run GGUF with vLLM and ROCM

Related topics