How to run GGUF with vLLM and ROCM

Getting Value error, torch.bfloat16 is not supported for quantization method gguf. Supported dtypes: [torch.float16, torch.float32] [type=value_error, input_value=ArgsKwargs((), {‘model_co…transfer_config’: None}), input_type=ArgsKwargs]

docker run -it --dns=192.168.49.1 --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri --shm-size=16g -e HUGGING_FACE_HUB_TOKEN=“ -e VLLM_SLEEP_WHEN_IDLE=1 -e ROCM_VISIBLE_DEVICES=0,1 -e HIP_VISIBLE_DEVICES=0,1 -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e PYTORCH_ROCM_ARCH=“gfx1100” -e VLLM_USE_TRITON_FLASH_ATTN=0 -e GPU_MAX_HW_QUEUES=1 -e NCCL_DEBUG=WARN -e NCCL_IB_DISABLE=1 --restart unless-stopped --name vllm_rocm_gemma-3-27b-it-qat-q4_0-gguf2 -v /home/ubuntu/vllm_models:/models rocm/vllm-dev:nightly vllm serve /models/gemma-3-27b-it-Q5_K_M.gguf --host 0.0.0.0 --port 8000 --enforce-eager --served-model-name gemma3 --tensor-parallel-size 2 --trust-remote-code --dtype auto --kv-cache-dtype auto --max-model-len 1024 --max-num-seqs 12 --max-num-batched-tokens 4000 --gpu-memory-utilization 0.90 --swap-space 12 --disable-log-requests --disable-log-stats --max-log-len 100

The error occurs because torch.bfloat16 is not supported for GGUF quantized models in vLLM; only torch.float16 and torch.float32 are allowed. You must explicitly set --dtype float16 or --dtype float32 when serving GGUF models to avoid this error—auto may select an unsupported dtype depending on your hardware and model config.

According to vllm.model_executor.layers.quantization.gguf documentation, GGUF quantization supports only float16 and float32 activation dtypes. Please update your command to include --dtype float16 or --dtype float32.

Would you like more detail or a sample command?

Sources:

Hint: Mention @RunLLM in the post for followups.

Value error, The model type ‘gemma3_text’ does not support float16. Reason: Numerical instability. Please use bfloat16 or float32 instead. [type=value_error, input_value=ArgsKwargs((), {‘model’: …rocessor_plugin’: None}), input_type=ArgsKwargs]

@RunLLM

The error means the model type gemma3_text is known to be numerically unstable with float16; you must use bfloat16 or float32 as the dtype. This is enforced in vLLM for Gemma3 models due to precision issues with float16, as documented in the code and issues (see _FLOAT16_NOT_SUPPORTED_MODELS).

Set --dtype bfloat16 or --dtype float32 when serving or using this model to resolve the error.

Would you like more detail or example commands?

Sources:

Hint: Mention @RunLLM in the post for followups.

So it looks like its not possible to run GGUF models with ROCM and 7900 XTX.