Why latest rocm vllm is so bad?

youlearnit · December 14, 2025, 9:29am

What is wrong in this command, it uses the latest vllm rocm docker:

docker run -it --dns=192.168.49.1 --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri --shm-size=16g -e HUGGING_FACE_HUB_TOKEN=“” -e VLLM_SLEEP_WHEN_IDLE=1 -e ROCM_VISIBLE_DEVICES=0 -e HIP_VISIBLE_DEVICES=0 -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e PYTORCH_ROCM_ARCH=“gfx1100” -e VLLM_USE_TRITON_FLASH_ATTN=0 -e GPU_MAX_HW_QUEUES=1 -e NCCL_DEBUG=WARN -e NCCL_IB_DISABLE=1 --shm-size=10g --name vllm_rocm_gemma3 -v /home/ubuntu/vllm_models:/root/.cache/huggingface rocm/v
llm:latest vllm serve google/gemma-3-27b-it-qat-q4_0-gguf --host 0.0.0.0 --port 8000 --enforce-eager --served-model-name gemma3 --trust-remote-code --dtype auto --kv-cache-dtype auto --max-model-len
2048 --max-num-seqs 2 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.93 --swap-space 8 --disable-log-requests --disable-log-stats --max-log-len 100
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
(APIServer pid=1) INFO 12-14 09:24:54 [api_server.py:1338] vLLM API server version 0.11.2.dev673+g839868462
(APIServer pid=1) INFO 12-14 09:24:54 [utils.py:253] non-default args: {‘model_tag’: ‘google/gemma-3-27b-it-qat-q4_0-gguf’, ‘host’: ‘0.0.0.0’, ‘max_log_len’: 100, ‘model’: ‘google/gemma-3-27b-it-qat-q4_0-gguf’, ‘trust_remote_code’: True, ‘max_model_len’: 2048, ‘enforce_eager’: True, ‘served_model_name’: [‘gemma3’], ‘gpu_memory_utilization’: 0.93, ‘swap_space’: 8.0, ‘max_num_batched_tokens’: 2048, ‘max_num_seqs’: 2, ‘disable_log_stats’: True}
(APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “/usr/local/bin/vllm”, line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 60, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1385, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1404, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 172, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 198, in build_async_engine_client_from_engine_args
(APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1371, in create_engine_config
(APIServer pid=1) model_config = self.create_model_config()
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 1227, in create_model_config
(APIServer pid=1) return ModelConfig(
(APIServer pid=1) ^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py”, line 121, in init
(APIServer pid=1) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1) Value error, Invalid repository ID or local directory specified: ‘google/gemma-3-27b-it-qat-q4_0-gguf’.
(APIServer pid=1) Please verify the following requirements:
(APIServer pid=1) 1. Provide a valid Hugging Face repository ID.
(APIServer pid=1) 2. Specify a local directory that contains a recognized configuration file.
(APIServer pid=1) - For Hugging Face models: ensure the presence of a ‘config.json’.
(APIServer pid=1) - For Mistral models: ensure the presence of a ‘params.json’.
(APIServer pid=1) [type=value_error, input_value=ArgsKwargs((), {‘model’: …rocessor_plugin’: None}), input_type=ArgsKwargs]
(APIServer pid=1) For further information visit Redirecting...

RunLLM · December 14, 2025, 9:29am

The error is caused by passing google/gemma-3-27b-it-qat-q4_0-gguf as the model argument, which is not a valid Hugging Face repository ID or a local directory containing a recognized config file. For GGUF models, you must provide the local path to the .gguf file, not the Hugging Face repo name. Also, your command contains invalid quote characters (â and â) in environment variables, which should be replaced with standard ASCII quotes.

To fix: download the GGUF file to your host, mount the containing directory into the container, and use the full local path to the .gguf file as the model argument (e.g., /root/.cache/huggingface/gemma-3-27b-it-qat-q4_0-gguf.gguf). See the error message and discussion in vllm issue #22164 and vllm issue #14696. The command should look like:

docker run ... \
  -v /home/ubuntu/vllm_models:/root/.cache/huggingface \
  rocm/vllm:latest \
  vllm serve /root/.cache/huggingface/gemma-3-27b-it-qat-q4_0-gguf.gguf \
  ...

Would you like a step-by-step breakdown or help with downloading the GGUF file?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · December 14, 2025, 9:33am

yes

yes give detailed instructions

RunLLM · December 14, 2025, 9:33am

Certainly! Here are detailed, step-by-step instructions for running a GGUF model (like Gemma-3-27B-IT-QAT-Q4_0-GGUF) with the latest vLLM ROCm Docker image:

1. Download the GGUF Model File

You must have the .gguf file for your model on your local machine. You can download it from Hugging Face or another source. For example, if the file is named gemma-3-27b-it-qat-q4_0-gguf.gguf, place it in a directory such as /home/ubuntu/vllm_models.

2. Mount the Model Directory in Docker

When running the Docker container, mount the directory containing your .gguf file to a path inside the container (commonly /root/.cache/huggingface). This makes the file accessible to vLLM inside the container.

3. Use the Full Path to the GGUF File

In your vllm serve command, specify the full path to the .gguf file inside the container (e.g., /root/.cache/huggingface/gemma-3-27b-it-qat-q4_0-gguf.gguf). Do not use the Hugging Face repo name as the model argument.

4. Correct Environment Variables and Quotes

Make sure all environment variables use standard ASCII quotes ("), not curly or non-ASCII quotes. For example:
-e HUGGING_FACE_HUB_TOKEN=""
Also, ensure the image name is correct: rocm/vllm:latest (no space).

5. Example Command

Here is a corrected example command:

docker run -it \
  --dns=192.168.49.1 \
  --network=host \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  --shm-size=16g \
  -e HUGGING_FACE_HUB_TOKEN="" \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e ROCM_VISIBLE_DEVICES=0 \
  -e HIP_VISIBLE_DEVICES=0 \
  -e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
  -e PYTORCH_ROCM_ARCH="gfx1100" \
  -e VLLM_USE_TRITON_FLASH_ATTN=0 \
  -e GPU_MAX_HW_QUEUES=1 \
  -e NCCL_DEBUG=WARN \
  -e NCCL_IB_DISABLE=1 \
  --name vllm_rocm_gemma3 \
  -v /home/ubuntu/vllm_models:/root/.cache/huggingface \
  rocm/vllm:latest \
  vllm serve /root/.cache/huggingface/gemma-3-27b-it-qat-q4_0-gguf.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --enforce-eager \
    --served-model-name gemma3 \
    --dtype auto \
    --kv-cache-dtype auto \
    --max-model-len 2048 \
    --max-num-seqs 2 \
    --max-num-batched-tokens 2048 \
    --gpu-memory-utilization 0.93 \
    --swap-space 8 \
    --disable-log-requests \
    --disable-log-stats \
    --max-log-len 100

6. Notes and Troubleshooting

If your model is split into multiple .gguf files, you must merge them into a single file using the gguf-split tool before running vLLM (see instructions).
Always use the tokenizer from the base model (e.g., add --tokenizer google/gemma-3-27b-it if needed).
If you get errors about missing config files, double-check the path and filename.

Would you like instructions for merging multi-part GGUF files or for using a custom tokenizer?