Make install easier

I have tried to install vLLM on Ubuntu 24.04 3 full days. I have 2 7900 XTX cards, but its incredibly hard. Your instructions are full of crap.
Why dont you make easy step by step instruction to run vLLM? AMD loses so much profits.

For example is this the correct way to run docker?

docker run -it
–dns=8.8.8.8
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–privileged
–device /dev/kfd
–device /dev/dri
-e ROCM_VISIBLE_DEVICES=0,1,2,3
-e HIP_VISIBLE_DEVICES=0,1,2,3
-e CUDA_VISIBLE_DEVICES=0,1,2,3
-e VLLM_USE_TRITON_FLASH_ATTN=0
-e PYTORCH_TUNABLEOP_ENABLED=1
-e HSA_OVERRIDE_GFX_VERSION=11.0.0
-e PYTORCH_ROCM_ARCH=“gfx1100”
-e GPU_MAX_HW_QUEUES=1
-v /home/ubuntu/vllm_models:/workspace/models
rocm/vllm:latest bash

Your docker command is mostly correct, but there are some issues with the dash characters (they appear as “–” instead of “–") and with the quotes around PYTORCH_ROCM_ARCH. The recommended way to run vLLM on dual 7900 XTX (gfx1100) is to use the official ROCm vLLM docker image, with proper device and environment variable settings. Here is a step-by-step example based on the official docs:

docker run -it \
  --network=host \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v /home/ubuntu/vllm_models:/app/model \
  rocm/vllm:latest \
  bash

You do not need to set CUDA_VISIBLE_DEVICES for AMD cards. For 7900 XTX, use the rocm/vllm-dev:navi_base image for best compatibility. See official ROCm install instructions and docker usage for more details.

Would you like a full step-by-step guide, including model serving?

Sources:

INFO 07-24 08:55:18 [default_loader.py:272] Loading weights took 5.30 seconds
INFO 07-24 08:55:18 [gpu_model_runner.py:1782] Model loading took 12.8730 GiB and 73.378016 seconds
INFO 07-24 08:55:26 [backends.py:509] Using cache directory: /root/.cache/vllm/torch_compile_cache/27e5fbe160/rank_0_0/backbone for vLLM’s torch.compile
INFO 07-24 08:55:26 [backends.py:520] Dynamo bytecode transform time: 7.63 s
[rank0]:[W724 08:55:48.554045210 Tunable.cpp:366] Warning: Failed validator: GCN_ARCH_NAME (function ValidateAll)

The warning [W724 ... Tunable.cpp:366] Warning: Failed validator: GCN_ARCH_NAME (function ValidateAll) indicates that the ROCm stack could not validate the GPU architecture name, which may affect performance or compatibility. This is a ROCm/PyTorch warning and not a fatal error for vLLM; your model should still run if other steps succeed. For 7900 XTX (gfx1100), ensure your ROCm, PyTorch, and vLLM versions are compatible and that the correct PYTORCH_ROCM_ARCH is set to gfx1100 as per official ROCm install instructions.

Would you like troubleshooting steps to resolve or suppress this warning?

Sources:

So now got
INFO: Started server process [295]
INFO: Waiting for application startup.
INFO: Application startup complete.

how to test the model?
its in docker

To test your vLLM model running in Docker, open a new terminal on your host and run a curl command to the server, for example:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<your-model-name>",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
  }'

Replace <your-model-name> with the actual model you loaded. This will return a completion if the server is running correctly. See official quickstart for more details.

Would you like a Python example or help with chat completions?

Sources:

INFO 07-24 09:00:29 [chat_utils.py:421] Detected the chat template content format to be ‘openai’. You can set --chat-template-content-format to override this.
ERROR 07-24 09:00:29 [serving_chat.py:203] Error in preprocessing prompt inputs
ERROR 07-24 09:00:29 [serving_chat.py:203] Traceback (most recent call last):
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py”, line 186, in create_chat_completion
ERROR 07-24 09:00:29 [serving_chat.py:203] ) = await self._preprocess_chat(
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py”, line 789, in _preprocess_chat
ERROR 07-24 09:00:29 [serving_chat.py:203] conversation, mm_data_future = parse_chat_messages_futures(
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py”, line 1200, in parse_chat_messages_futures
ERROR 07-24 09:00:29 [serving_chat.py:203] sub_messages = _parse_chat_message_content(
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py”, line 1124, in _parse_chat_message_content
ERROR 07-24 09:00:29 [serving_chat.py:203] result = _parse_chat_message_content_parts(
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py”, line 1024, in _parse_chat_message_content_parts
ERROR 07-24 09:00:29 [serving_chat.py:203] parse_res = _parse_chat_message_content_part(
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py”, line 1081, in _parse_chat_message_content_part
ERROR 07-24 09:00:29 [serving_chat.py:203] mm_parser.parse_image(str_content)
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py”, line 781, in parse_image
ERROR 07-24 09:00:29 [serving_chat.py:203] placeholder = self._tracker.add(“image”, image_coro)
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py”, line 590, in add
ERROR 07-24 09:00:29 [serving_chat.py:203] mm_processor = mm_registry.create_processor(model_config)
ERROR 07-24 09:00:29 [serving_chat.py:203] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 09:00:29 [serving_chat.py:203] File “/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py”, line 267, in create_processor
ERROR 07-24 09:00:29 [serving_chat.py:203] raise ValueError(f"{model_config.model} is not a multimodal model")
ERROR 07-24 09:00:29 [serving_chat.py:203] ValueError: google/gemma-3n-E4B-it is not a multimodal model
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py:204: RuntimeWarning: coroutine ‘MediaConnector.fetch_image_async’ was never awaited
return self.create_error_response(f"{e} {e.cause}")
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO: 127.0.0.1:43962 - “POST /v1/chat/completions HTTP/1.1” 400 Bad Request

The error ValueError: google/gemma-3n-E4B-it is not a multimodal model means you tried to send an image (or multimodal) input to a model that only supports text. The Gemma-3n-E4B-it model is not a vision/multimodal model, so it cannot process images—only text prompts are supported. This is expected behavior as per supported models documentation.

To fix:

  • Only send text in your chat/completions request for this model.
  • If you want to use images, choose a model with vision/multimodal support (see the “T + I” column in the supported models list).

Would you like an example of a valid text-only request?

Sources:

curl http://localhost:8000/v1/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “gemma-3n-E4B-it”,
“prompt”: “San Francisco is a”,
“max_tokens”: 7,
“temperature”: 0
}’
{“object”:“error”,“message”:“The model gemma-3n-E4B-it does not exist.”,“type”:“NotFoundError”,“param”:null,“code”:404}ubuntu@ubuntttu:~$

The error The model gemma-3n-E4B-it does not exist. means the model name in your request does not match the name the server is serving. When using a local path or custom model, you must use the exact name specified by the --served-model-name argument (or the default, which is usually the path or repo name). Also, your curl command uses invalid quote characters—replace all “ and ” with ".

Example (with correct quotes and model name):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3n-E4B-it",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
  }'

To find the correct model name, run:

curl http://localhost:8000/v1/models

and use the “id” field from the response as your model name. See official quickstart for details.

Would you like more help matching the model name?

Sources:

Ok, now it worked. is this one openAI api compatible?

Yes, vLLM’s server is OpenAI API compatible for endpoints like /v1/completions, /v1/chat/completions, and /v1/embeddings, so you can use it as a drop-in replacement for OpenAI API clients and libraries, including the official OpenAI Python client. See official quickstart and OpenAI-compatible server docs.

Would you like details on supported endpoints or usage examples?

Sources: