I have tried to install vLLM on Ubuntu 24.04 3 full days. I have 2 7900 XTX cards, but its incredibly hard. Your instructions are full of crap.
Why dont you make easy step by step instruction to run vLLM? AMD loses so much profits.
For example is this the correct way to run docker?
Your docker command is mostly correct, but there are some issues with the dash characters (they appear as “â” instead of “–") and with the quotes around PYTORCH_ROCM_ARCH. The recommended way to run vLLM on dual 7900 XTX (gfx1100) is to use the official ROCm vLLM docker image, with proper device and environment variable settings. Here is a step-by-step example based on the official docs:
You do not need to set CUDA_VISIBLE_DEVICES for AMD cards. For 7900 XTX, use the rocm/vllm-dev:navi_base image for best compatibility. See official ROCm install instructions and docker usage for more details.
Would you like a full step-by-step guide, including model serving?
INFO 07-24 08:55:18 [default_loader.py:272] Loading weights took 5.30 seconds
INFO 07-24 08:55:18 [gpu_model_runner.py:1782] Model loading took 12.8730 GiB and 73.378016 seconds
INFO 07-24 08:55:26 [backends.py:509] Using cache directory: /root/.cache/vllm/torch_compile_cache/27e5fbe160/rank_0_0/backbone for vLLM’s torch.compile
INFO 07-24 08:55:26 [backends.py:520] Dynamo bytecode transform time: 7.63 s
[rank0]:[W724 08:55:48.554045210 Tunable.cpp:366] Warning: Failed validator: GCN_ARCH_NAME (function ValidateAll)
The warning [W724 ... Tunable.cpp:366] Warning: Failed validator: GCN_ARCH_NAME (function ValidateAll) indicates that the ROCm stack could not validate the GPU architecture name, which may affect performance or compatibility. This is a ROCm/PyTorch warning and not a fatal error for vLLM; your model should still run if other steps succeed. For 7900 XTX (gfx1100), ensure your ROCm, PyTorch, and vLLM versions are compatible and that the correct PYTORCH_ROCM_ARCH is set to gfx1100 as per official ROCm install instructions.
Would you like troubleshooting steps to resolve or suppress this warning?
Replace <your-model-name> with the actual model you loaded. This will return a completion if the server is running correctly. See official quickstart for more details.
Would you like a Python example or help with chat completions?
The error ValueError: google/gemma-3n-E4B-it is not a multimodal model means you tried to send an image (or multimodal) input to a model that only supports text. The Gemma-3n-E4B-it model is not a vision/multimodal model, so it cannot process images—only text prompts are supported. This is expected behavior as per supported models documentation.
To fix:
Only send text in your chat/completions request for this model.
If you want to use images, choose a model with vision/multimodal support (see the “T + I” column in the supported models list).
Would you like an example of a valid text-only request?
curl http://localhost:8000/v1/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “gemma-3n-E4B-it”,
“prompt”: “San Francisco is a”,
“max_tokens”: 7,
“temperature”: 0
}’
{“object”:“error”,“message”:“The model gemma-3n-E4B-it does not exist.”,“type”:“NotFoundError”,“param”:null,“code”:404}ubuntu@ubuntttu:~$
The error The model gemma-3n-E4B-it does not exist. means the model name in your request does not match the name the server is serving. When using a local path or custom model, you must use the exact name specified by the --served-model-name argument (or the default, which is usually the path or repo name). Also, your curl command uses invalid quote characters—replace all “ and ” with ".
Yes, vLLM’s server is OpenAI API compatible for endpoints like /v1/completions, /v1/chat/completions, and /v1/embeddings, so you can use it as a drop-in replacement for OpenAI API clients and libraries, including the official OpenAI Python client. See official quickstart and OpenAI-compatible server docs.
Would you like details on supported endpoints or usage examples?