How to setup amd gpu as default in dual stack gpu?

Hi

I’m having issues with vllm because I have a two stack gpu in a machine. One is nvidia and another is amd. My nvidia card is an older one but maybe I’m going to use it as an accelerator for speculative decoding with a larger model in my amd gpu. But I can’t launch models in my amd gpu by default because it seems that vllm detects that I have cuda also in my machine and always starts to load models in my nvidia card. How can I set that by default vllm launches models in my amd gpu first ?

btw I have latest cuda and ROCm versions, I have tested both using Ollama, pytorch and docker, most of the time works both fine.

Woah, you have a very interesting setup.

Can you try the following steps to check if the vllm built with ROCm support will be able to use the AMD GPU:
Step 1:

docker run -it --rm \
    --ipc=host \
    --network=host \
    --privileged \
    --cap-add=CAP_SYS_ADMIN \
    --device=/dev/kfd \
    --device=/dev/dri \
    --device=/dev/mem \
    --group-add render \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    rocm/vllm-dev:main

Step 2:

vllm serve <yourmodel>

And what is the order of the GPU?
First GPU is NVIDIA and second GPU is AMD?

hi @tjtanaa

yes the first one is nvidia and the second one is amd. So the numbers are nvidia == 0 , amd == 1 .

So in vllm serve should be <1> in yourmodel ?

I’ll try to use the docker approach, but is there any way to run it in a virt environment in python ? not sure if there is a flag to send the amd gpu first.

The main goal is launch two models, the smaller one in my nvidia gpu (It’s pascal based with 4gb of vram, so a little model could be fine for speculative decoding) and set the bigger one in the amd gpu. I’m not sure if the docker approach could let me do that. It is possible ? I tried also with gpu stack and that works with both gpus, but that’s setting one model in two gpus, not speculative decoding. So i really want to be able to first send the bigger model to the amd one, and second launch a smaller model in the nvidia in speculative decoding mode. If that’s not possible, then I want to do that but only with my amd gpu, launching two models in the same gpu for speculative decoding.

Nevertheless thanks for your help :slight_smile:

@CarlosR759
Based on my understanding, with vLLM, the speculative decoding model will need to share the same GPU as the large model. There isn’t a configuration that allows speculative decoding model to be on one GPU, and the base model to be on another GPU.

Descriptively, the following is possible:

  1. Base model TP2, draft model
GPU 0 GPU 1
Base Model (first half) Base Model (second half)
Draft Model
  1. The following is not possible yet
GPU 0 GPU 1
Draft Model Base Model

Moreover, the draft model in speculative decoding currently needs to be run without tensor parallelism, meaning draft_tensor_parallel_size should be set to 1.

1 Like

Thank you so much for the information. I’ll try to do that after I fix the current issue.

I made this with running deepseek r1 1.5B qwen destillation. But if failed like this:

config.json: 100%|█████████████████████████████████████████████████████| 679/679 [00:00<00:00, 9.13MB/s]
INFO 04-11 00:00:59 [__init__.py:256] Automatically detected platform rocm.
INFO 04-11 00:01:06 [config.py:578] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 04-11 00:01:08 [config.py:1508] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
INFO 04-11 00:01:08 [config.py:1520] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
WARNING 04-11 00:01:08 [arg_utils.py:1282] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 04-11 00:01:08 [rocm.py:228] Aiter main switch (VLLM_USE_AITER) is not set. Disabling individual Aiter components
INFO 04-11 00:01:08 [config.py:578] This model supports multiple tasks: {'classify', 'generate', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
tokenizer_config.json: 100%|███████████████████████████████████████| 3.07k/3.07k [00:00<00:00, 35.0MB/s]
tokenizer.json:   0%|                                                       | 0.00/7.03M [00:00<?, ?B/s]INFO 04-11 00:01:10 [config.py:1508] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
INFO 04-11 00:01:10 [config.py:1520] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
WARNING 04-11 00:01:10 [arg_utils.py:1282] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 04-11 00:01:10 [rocm.py:228] Aiter main switch (VLLM_USE_AITER) is not set. Disabling individual Aiter components
INFO 04-11 00:01:10 [engine.py:77] Initializing a V0 LLM engine (v0.7.4.dev388+g51641aaa7) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
tokenizer.json: 100%|██████████████████████████████████████████████| 7.03M/7.03M [00:01<00:00, 5.97MB/s]
generation_config.json: 100%|███████████████████████████████████████████| 181/181 [00:00<00:00, 664kB/s]
INFO 04-11 00:01:13 [rocm.py:133] None is not supported in AMD GPUs.
INFO 04-11 00:01:13 [rocm.py:134] Using ROCmFlashAttention backend.
[rank0]:[W411 00:01:13.369539287 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-11 00:01:13 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-11 00:01:13 [model_runner.py:1115] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
WARNING 04-11 00:01:13 [rocm.py:239] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
ERROR 04-11 00:01:14 [engine.py:411] HIP error: invalid device function
ERROR 04-11 00:01:14 [engine.py:411] HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-11 00:01:14 [engine.py:411] For debugging consider passing AMD_SERIALIZE_KERNEL=3
ERROR 04-11 00:01:14 [engine.py:411] Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
ERROR 04-11 00:01:14 [engine.py:411] Traceback (most recent call last):
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
ERROR 04-11 00:01:14 [engine.py:411]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 04-11 00:01:14 [engine.py:411]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
ERROR 04-11 00:01:14 [engine.py:411]     return cls(ipc_path=ipc_path,
ERROR 04-11 00:01:14 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-11 00:01:14 [engine.py:411]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "vllm/engine/llm_engine.py", line 274, in vllm.engine.llm_engine.LLMEngine.__init__
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     self._init_executor()
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-11 00:01:14 [engine.py:411]     self.collective_rpc("load_model")
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-11 00:01:14 [engine.py:411]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-11 00:01:14 [engine.py:411]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2444, in run_method
ERROR 04-11 00:01:14 [engine.py:411]     return func(*args, **kwargs)
ERROR 04-11 00:01:14 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 211, in load_model
ERROR 04-11 00:01:14 [engine.py:411]     self.model_runner.load_model()
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1118, in load_model
ERROR 04-11 00:01:14 [engine.py:411]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-11 00:01:14 [engine.py:411]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-11 00:01:14 [engine.py:411]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-11 00:01:14 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 423, in load_model
ERROR 04-11 00:01:14 [engine.py:411]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-11 00:01:14 [engine.py:411]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
ERROR 04-11 00:01:14 [engine.py:411]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-11 00:01:14 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 431, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     self.model = Qwen2Model(vllm_config=vllm_config,
ERROR 04-11 00:01:14 [engine.py:411]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 300, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-11 00:01:14 [engine.py:411]                                                     ^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 558, in make_layers
ERROR 04-11 00:01:14 [engine.py:411]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-11 00:01:14 [engine.py:411]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
ERROR 04-11 00:01:14 [engine.py:411]     lambda prefix: Qwen2DecoderLayer(config=config,
ERROR 04-11 00:01:14 [engine.py:411]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     self.self_attn = Qwen2Attention(
ERROR 04-11 00:01:14 [engine.py:411]                      ^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     self.rotary_emb = get_rope(
ERROR 04-11 00:01:14 [engine.py:411]                       ^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 1111, in get_rope
ERROR 04-11 00:01:14 [engine.py:411]     rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
ERROR 04-11 00:01:14 [engine.py:411]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 99, in __init__
ERROR 04-11 00:01:14 [engine.py:411]     cache = self._compute_cos_sin_cache()
ERROR 04-11 00:01:14 [engine.py:411]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 116, in _compute_cos_sin_cache
ERROR 04-11 00:01:14 [engine.py:411]     inv_freq = self._compute_inv_freq(self.base)
ERROR 04-11 00:01:14 [engine.py:411]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 110, in _compute_inv_freq
ERROR 04-11 00:01:14 [engine.py:411]     inv_freq = 1.0 / (base**(torch.arange(
ERROR 04-11 00:01:14 [engine.py:411]                              ^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
ERROR 04-11 00:01:14 [engine.py:411]     return func(*args, **kwargs)
ERROR 04-11 00:01:14 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 00:01:14 [engine.py:411] RuntimeError: HIP error: invalid device function
ERROR 04-11 00:01:14 [engine.py:411] HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-11 00:01:14 [engine.py:411] For debugging consider passing AMD_SERIALIZE_KERNEL=3
ERROR 04-11 00:01:14 [engine.py:411] Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
ERROR 04-11 00:01:14 [engine.py:411]
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 413, in run_mp_engine
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/engine/llm_engine.py", line 274, in vllm.engine.llm_engine.LLMEngine.__init__
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
    self.collective_rpc("load_model")
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2444, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 211, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1118, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
    return loader.load_model(vllm_config=vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 423, in load_model
    model = _initialize_model(vllm_config=vllm_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 431, in __init__
    self.model = Qwen2Model(vllm_config=vllm_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 300, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 558, in make_layers
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
    lambda prefix: Qwen2DecoderLayer(config=config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in __init__
    self.self_attn = Qwen2Attention(
                     ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in __init__
    self.rotary_emb = get_rope(
                      ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 1111, in get_rope
    rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 99, in __init__
    cache = self._compute_cos_sin_cache()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 116, in _compute_cos_sin_cache
    inv_freq = self._compute_inv_freq(self.base)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 110, in _compute_inv_freq
    inv_freq = 1.0 / (base**(torch.arange(
                             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[rank0]:[W411 00:01:14.958224533 ProcessGroupNCCL.cpp:1505] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

As you can see it said that

Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.

Not sure if my nvidia card is being detected in this error. There is an issue with HIP, it’s also happening to me when launching some applications with pytorch alone. What do you think could be the error ? In the arch wiki you can see the possible packages available for hip for ROCm:

I currently only have the rocm-hip-runtime package installed ,not sure if I need the hip-runtime-nvidia also. My GPU is the RX 7600 xt. Not sure is supported with vllm, searching in the documentation it says that the 7900 xt series is supported, but since both graphics cards are RDNA 3 I’m giving a try.

What do you think could be the issue ?

@CarlosR759 If your AMD GPU is second gpu, can you use the environment variable HIP_VISIBLE_DEVICES=1?

You mean after launching the docker container, since it gave you shell access by default, just I need to do export HIP_VISIBLE_DEVICES=1 ?

If you mean that, I made that and this the log of the container when I’m trying to launch the same model in my before attempt:

INFO 04-21 18:46:33 [api_server.py:209] Started engine process with PID 45
config.json: 100%|█████████████████████████████████████████████████████| 679/679 [00:00<00:00, 7.93MB/s]
INFO 04-21 18:46:35 [__init__.py:256] Automatically detected platform rocm.
ERROR 04-21 18:46:40 [registry.py:330] Error in inspecting model architecture 'Qwen2ForCausalLM'
ERROR 04-21 18:46:40 [registry.py:330] Traceback (most recent call last):
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 553, in _run_in_subprocess
ERROR 04-21 18:46:40 [registry.py:330]     returned.check_returncode()
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/lib/python3.12/subprocess.py", line 504, in check_returncode
ERROR 04-21 18:46:40 [registry.py:330]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 04-21 18:46:40 [registry.py:330] subprocess.CalledProcessError: Command '['/usr/bin/python3.12', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 04-21 18:46:40 [registry.py:330]
ERROR 04-21 18:46:40 [registry.py:330] The above exception was the direct cause of the following exception:
ERROR 04-21 18:46:40 [registry.py:330]
ERROR 04-21 18:46:40 [registry.py:330] Traceback (most recent call last):
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 328, in _try_inspect_model_cls
ERROR 04-21 18:46:40 [registry.py:330]     return model.inspect_model_cls()
ERROR 04-21 18:46:40 [registry.py:330]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 299, in inspect_model_cls
ERROR 04-21 18:46:40 [registry.py:330]     return _run_in_subprocess(
ERROR 04-21 18:46:40 [registry.py:330]            ^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 556, in _run_in_subprocess
ERROR 04-21 18:46:40 [registry.py:330]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 04-21 18:46:40 [registry.py:330] RuntimeError: Error raised in subprocess:
ERROR 04-21 18:46:40 [registry.py:330] <frozen runpy>:128: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 04-21 18:46:40 [registry.py:330] Traceback (most recent call last):
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen runpy>", line 198, in _run_module_as_main
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen runpy>", line 88, in _run_code
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 577, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     _run()
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 570, in _run
ERROR 04-21 18:46:40 [registry.py:330]     result = fn()
ERROR 04-21 18:46:40 [registry.py:330]              ^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 300, in <lambda>
ERROR 04-21 18:46:40 [registry.py:330]     lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
ERROR 04-21 18:46:40 [registry.py:330]                                       ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 303, in load_model_cls
ERROR 04-21 18:46:40 [registry.py:330]     mod = importlib.import_module(self.module_name)
ERROR 04-21 18:46:40 [registry.py:330]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
ERROR 04-21 18:46:40 [registry.py:330]     return _bootstrap._gcd_import(name[level:], package, level)
ERROR 04-21 18:46:40 [registry.py:330]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap_external>", line 999, in exec_module
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 32, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.attention import Attention, AttentionType
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/__init__.py", line 8, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.attention.layer import Attention
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 14, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.model_executor.layers.linear import UnquantizedLinearMethod
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 24, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.model_executor.layers.tuned_gemm import tgemm
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/tuned_gemm.py", line 199, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     tgemm = TunedGemm()
ERROR 04-21 18:46:40 [registry.py:330]             ^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/tuned_gemm.py", line 44, in __init__
ERROR 04-21 18:46:40 [registry.py:330]     self.cu_count = torch.cuda.get_device_properties(
ERROR 04-21 18:46:40 [registry.py:330]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 580, in get_device_properties
ERROR 04-21 18:46:40 [registry.py:330]     _lazy_init()  # will define _get_device_properties
ERROR 04-21 18:46:40 [registry.py:330]     ^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 376, in _lazy_init
ERROR 04-21 18:46:40 [registry.py:330]     torch._C._cuda_init()
ERROR 04-21 18:46:40 [registry.py:330] RuntimeError: No HIP GPUs are available
ERROR 04-21 18:46:40 [registry.py:330]
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 220, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1204, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1130, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 392, in __init__
    self.multimodal_config = self._init_multimodal_config(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 460, in _init_multimodal_config
    if self.registry.is_multimodal_model(self.architectures):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 478, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 438, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 390, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
ERROR 04-21 18:46:40 [registry.py:330] Error in inspecting model architecture 'Qwen2ForCausalLM'
ERROR 04-21 18:46:40 [registry.py:330] Traceback (most recent call last):
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 553, in _run_in_subprocess
ERROR 04-21 18:46:40 [registry.py:330]     returned.check_returncode()
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/lib/python3.12/subprocess.py", line 504, in check_returncode
ERROR 04-21 18:46:40 [registry.py:330]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 04-21 18:46:40 [registry.py:330] subprocess.CalledProcessError: Command '['/usr/bin/python3.12', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 04-21 18:46:40 [registry.py:330]
ERROR 04-21 18:46:40 [registry.py:330] The above exception was the direct cause of the following exception:
ERROR 04-21 18:46:40 [registry.py:330]
ERROR 04-21 18:46:40 [registry.py:330] Traceback (most recent call last):
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 328, in _try_inspect_model_cls
ERROR 04-21 18:46:40 [registry.py:330]     return model.inspect_model_cls()
ERROR 04-21 18:46:40 [registry.py:330]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 299, in inspect_model_cls
ERROR 04-21 18:46:40 [registry.py:330]     return _run_in_subprocess(
ERROR 04-21 18:46:40 [registry.py:330]            ^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 556, in _run_in_subprocess
ERROR 04-21 18:46:40 [registry.py:330]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 04-21 18:46:40 [registry.py:330] RuntimeError: Error raised in subprocess:
ERROR 04-21 18:46:40 [registry.py:330] <frozen runpy>:128: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 04-21 18:46:40 [registry.py:330] Traceback (most recent call last):
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen runpy>", line 198, in _run_module_as_main
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen runpy>", line 88, in _run_code
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 577, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     _run()
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 570, in _run
ERROR 04-21 18:46:40 [registry.py:330]     result = fn()
ERROR 04-21 18:46:40 [registry.py:330]              ^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 300, in <lambda>
ERROR 04-21 18:46:40 [registry.py:330]     lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
ERROR 04-21 18:46:40 [registry.py:330]                                       ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 303, in load_model_cls
ERROR 04-21 18:46:40 [registry.py:330]     mod = importlib.import_module(self.module_name)
ERROR 04-21 18:46:40 [registry.py:330]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
ERROR 04-21 18:46:40 [registry.py:330]     return _bootstrap._gcd_import(name[level:], package, level)
ERROR 04-21 18:46:40 [registry.py:330]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap_external>", line 999, in exec_module
ERROR 04-21 18:46:40 [registry.py:330]   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 32, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.attention import Attention, AttentionType
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/__init__.py", line 8, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.attention.layer import Attention
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 14, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.model_executor.layers.linear import UnquantizedLinearMethod
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 24, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     from vllm.model_executor.layers.tuned_gemm import tgemm
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/tuned_gemm.py", line 199, in <module>
ERROR 04-21 18:46:40 [registry.py:330]     tgemm = TunedGemm()
ERROR 04-21 18:46:40 [registry.py:330]             ^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/tuned_gemm.py", line 44, in __init__
ERROR 04-21 18:46:40 [registry.py:330]     self.cu_count = torch.cuda.get_device_properties(
ERROR 04-21 18:46:40 [registry.py:330]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 580, in get_device_properties
ERROR 04-21 18:46:40 [registry.py:330]     _lazy_init()  # will define _get_device_properties
ERROR 04-21 18:46:40 [registry.py:330]     ^^^^^^^^^^^^
ERROR 04-21 18:46:40 [registry.py:330]   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 376, in _lazy_init
ERROR 04-21 18:46:40 [registry.py:330]     torch._C._cuda_init()
ERROR 04-21 18:46:40 [registry.py:330] RuntimeError: No HIP GPUs are available
ERROR 04-21 18:46:40 [registry.py:330]
ERROR 04-21 18:46:40 [engine.py:411] Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
ERROR 04-21 18:46:40 [engine.py:411] Traceback (most recent call last):
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
ERROR 04-21 18:46:40 [engine.py:411]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 04-21 18:46:40 [engine.py:411]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
ERROR 04-21 18:46:40 [engine.py:411]     engine_config = engine_args.create_engine_config(usage_context)
ERROR 04-21 18:46:40 [engine.py:411]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1204, in create_engine_config
ERROR 04-21 18:46:40 [engine.py:411]     model_config = self.create_model_config()
ERROR 04-21 18:46:40 [engine.py:411]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1130, in create_model_config
ERROR 04-21 18:46:40 [engine.py:411]     return ModelConfig(
ERROR 04-21 18:46:40 [engine.py:411]            ^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 392, in __init__
ERROR 04-21 18:46:40 [engine.py:411]     self.multimodal_config = self._init_multimodal_config(
ERROR 04-21 18:46:40 [engine.py:411]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 460, in _init_multimodal_config
ERROR 04-21 18:46:40 [engine.py:411]     if self.registry.is_multimodal_model(self.architectures):
ERROR 04-21 18:46:40 [engine.py:411]        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 478, in is_multimodal_model
ERROR 04-21 18:46:40 [engine.py:411]     model_cls, _ = self.inspect_model_cls(architectures)
ERROR 04-21 18:46:40 [engine.py:411]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 438, in inspect_model_cls
ERROR 04-21 18:46:40 [engine.py:411]     return self._raise_for_unsupported(architectures)
ERROR 04-21 18:46:40 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:46:40 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 390, in _raise_for_unsupported
ERROR 04-21 18:46:40 [engine.py:411]     raise ValueError(
ERROR 04-21 18:46:40 [engine.py:411] ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 413, in run_mp_engine
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1204, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1130, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 392, in __init__
    self.multimodal_config = self._init_multimodal_config(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 460, in _init_multimodal_config
    if self.registry.is_multimodal_model(self.architectures):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 478, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 438, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 390, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.

The error message indicates that the model architecture ‘Qwen2ForCausalLM’ failed to be inspected, likely due to a problem with the ROCm setup or the model’s compatibility with the current environment. The error “RuntimeError: No HIP GPUs are available” suggests that the system is unable to detect any available HIP GPUs, which is necessary for running models on ROCm.

  1. Check ROCm Installation: Ensure that ROCm is correctly installed and configured on your system. Verify that the ROCm drivers are up-to-date and compatible with your hardware.

  2. Environment Variables: Make sure that the environment variable HIP_VISIBLE_DEVICES is set correctly to specify the GPU(s) you want to use. You can set it by running export HIP_VISIBLE_DEVICES=0 (or another appropriate GPU index) inside the Docker container.

  3. Docker Configuration: Ensure that the Docker container is configured to access the GPU. This typically involves using the --gpus flag when starting the container, such as --gpus all or specifying particular devices.

  4. Model Compatibility: Verify that the model ‘Qwen2ForCausalLM’ is supported by vLLM and that all necessary dependencies are installed. You may need to check the vLLM documentation for any specific requirements or compatibility notes.

Would you like more detailed guidance on any of these steps?

Sources:

setting the env variable in my OS instead of the docker contianer gave me this [ The behavior was different when launchin the model]:

INFO 04-21 18:53:38 [api_server.py:209] Started engine process with PID 44
config.json: 100%|█████████████████████████████████████████████████████| 679/679 [00:00<00:00, 8.48MB/s]
INFO 04-21 18:53:40 [__init__.py:256] Automatically detected platform rocm.
INFO 04-21 18:53:49 [config.py:578] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 04-21 18:53:50 [config.py:578] This model supports multiple tasks: {'classify', 'generate', 'reward', 'embed', 'score'}. Defaulting to 'generate'.
INFO 04-21 18:53:51 [config.py:1508] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
INFO 04-21 18:53:51 [config.py:1520] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
WARNING 04-21 18:53:51 [arg_utils.py:1282] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 04-21 18:53:51 [rocm.py:228] Aiter main switch (VLLM_USE_AITER) is not set. Disabling individual Aiter components
tokenizer_config.json: 100%|███████████████████████████████████████| 3.07k/3.07k [00:00<00:00, 40.6MB/s]
INFO 04-21 18:53:52 [config.py:1508] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
INFO 04-21 18:53:52 [config.py:1520] Disabled the custom all-reduce kernel because it is not working correctly when using two AMD Navi GPUs.
WARNING 04-21 18:53:52 [arg_utils.py:1282] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 04-21 18:53:52 [rocm.py:228] Aiter main switch (VLLM_USE_AITER) is not set. Disabling individual Aiter components
INFO 04-21 18:53:52 [engine.py:77] Initializing a V0 LLM engine (v0.7.4.dev388+g51641aaa7) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
tokenizer.json: 100%|██████████████████████████████████████████████| 7.03M/7.03M [00:01<00:00, 4.40MB/s]
generation_config.json: 100%|███████████████████████████████████████████| 181/181 [00:00<00:00, 615kB/s]
INFO 04-21 18:53:56 [rocm.py:133] None is not supported in AMD GPUs.
INFO 04-21 18:53:56 [rocm.py:134] Using ROCmFlashAttention backend.
[rank0]:[W421 18:53:56.974263691 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-21 18:53:57 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-21 18:53:57 [model_runner.py:1115] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
WARNING 04-21 18:53:57 [rocm.py:239] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
ERROR 04-21 18:53:57 [engine.py:411] HIP error: invalid device function
ERROR 04-21 18:53:57 [engine.py:411] HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-21 18:53:57 [engine.py:411] For debugging consider passing AMD_SERIALIZE_KERNEL=3
ERROR 04-21 18:53:57 [engine.py:411] Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
ERROR 04-21 18:53:57 [engine.py:411] Traceback (most recent call last):
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
ERROR 04-21 18:53:57 [engine.py:411]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 04-21 18:53:57 [engine.py:411]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
ERROR 04-21 18:53:57 [engine.py:411]     return cls(ipc_path=ipc_path,
ERROR 04-21 18:53:57 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-21 18:53:57 [engine.py:411]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "vllm/engine/llm_engine.py", line 274, in vllm.engine.llm_engine.LLMEngine.__init__
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     self._init_executor()
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-21 18:53:57 [engine.py:411]     self.collective_rpc("load_model")
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-21 18:53:57 [engine.py:411]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-21 18:53:57 [engine.py:411]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2444, in run_method
ERROR 04-21 18:53:57 [engine.py:411]     return func(*args, **kwargs)
ERROR 04-21 18:53:57 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 211, in load_model
ERROR 04-21 18:53:57 [engine.py:411]     self.model_runner.load_model()
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1118, in load_model
ERROR 04-21 18:53:57 [engine.py:411]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-21 18:53:57 [engine.py:411]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-21 18:53:57 [engine.py:411]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-21 18:53:57 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 423, in load_model
ERROR 04-21 18:53:57 [engine.py:411]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-21 18:53:57 [engine.py:411]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
ERROR 04-21 18:53:57 [engine.py:411]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-21 18:53:57 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 431, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     self.model = Qwen2Model(vllm_config=vllm_config,
ERROR 04-21 18:53:57 [engine.py:411]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 300, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-21 18:53:57 [engine.py:411]                                                     ^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 558, in make_layers
ERROR 04-21 18:53:57 [engine.py:411]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-21 18:53:57 [engine.py:411]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
ERROR 04-21 18:53:57 [engine.py:411]     lambda prefix: Qwen2DecoderLayer(config=config,
ERROR 04-21 18:53:57 [engine.py:411]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     self.self_attn = Qwen2Attention(
ERROR 04-21 18:53:57 [engine.py:411]                      ^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     self.rotary_emb = get_rope(
ERROR 04-21 18:53:57 [engine.py:411]                       ^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 1111, in get_rope
ERROR 04-21 18:53:57 [engine.py:411]     rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
ERROR 04-21 18:53:57 [engine.py:411]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 99, in __init__
ERROR 04-21 18:53:57 [engine.py:411]     cache = self._compute_cos_sin_cache()
ERROR 04-21 18:53:57 [engine.py:411]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 116, in _compute_cos_sin_cache
ERROR 04-21 18:53:57 [engine.py:411]     inv_freq = self._compute_inv_freq(self.base)
ERROR 04-21 18:53:57 [engine.py:411]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 110, in _compute_inv_freq
ERROR 04-21 18:53:57 [engine.py:411]     inv_freq = 1.0 / (base**(torch.arange(
ERROR 04-21 18:53:57 [engine.py:411]                              ^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
ERROR 04-21 18:53:57 [engine.py:411]     return func(*args, **kwargs)
ERROR 04-21 18:53:57 [engine.py:411]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 18:53:57 [engine.py:411] RuntimeError: HIP error: invalid device function
ERROR 04-21 18:53:57 [engine.py:411] HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-21 18:53:57 [engine.py:411] For debugging consider passing AMD_SERIALIZE_KERNEL=3
ERROR 04-21 18:53:57 [engine.py:411] Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
ERROR 04-21 18:53:57 [engine.py:411]
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 413, in run_mp_engine
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/engine/llm_engine.py", line 274, in vllm.engine.llm_engine.LLMEngine.__init__
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
    self.collective_rpc("load_model")
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2444, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 211, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1118, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
    return loader.load_model(vllm_config=vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 423, in load_model
    model = _initialize_model(vllm_config=vllm_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 431, in __init__
    self.model = Qwen2Model(vllm_config=vllm_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 300, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 558, in make_layers
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
    lambda prefix: Qwen2DecoderLayer(config=config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in __init__
    self.self_attn = Qwen2Attention(
                     ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in __init__
    self.rotary_emb = get_rope(
                      ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 1111, in get_rope
    rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 99, in __init__
    cache = self._compute_cos_sin_cache()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 116, in _compute_cos_sin_cache
    inv_freq = self._compute_inv_freq(self.base)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 110, in _compute_inv_freq
    inv_freq = 1.0 / (base**(torch.arange(
                             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[rank0]:[W421 18:53:57.714810830 ProcessGroupNCCL.cpp:1505] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Setting both, the container and the os system with the env variable gave most of the same, but it also failed before than the previous one. The last error in the context of running the env variable in os and container is this:

ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.

It seems that ROCm get’s detected, but for some reason it keeps failing to launch the model.

If you know something to troubleshoot this, I’ll appreciate that so much, since I don’t know what to do now.

Other possibility is to launch my amd gpu with ZLUDA, to be able to use cuda, but I’m not sure if that is going to help me in launching vllm.

EDIT: This is suppose to be the extension of my last message.

The error messages indicate issues with the ROCm setup and model compatibility. Here are some steps to troubleshoot and potentially resolve the issues:

  1. Verify ROCm Installation: Ensure that ROCm is correctly installed and configured. Check that the ROCm drivers are up-to-date and compatible with your hardware. You can refer to the ROCm installation guide for detailed instructions.

  2. Environment Variables: Ensure that the HIP_VISIBLE_DEVICES environment variable is set correctly both in the host OS and inside the Docker container. This variable should specify the GPU(s) you want to use, e.g., export HIP_VISIBLE_DEVICES=0.

  3. Model Compatibility: Verify that the model ‘Qwen2ForCausalLM’ is supported by vLLM. Check the vLLM documentation for any specific requirements or compatibility notes.

  4. Flash Attention: The error message suggests that Triton flash attention does not support sliding window attention (SWA) in half-precision. You can try using CK flash attention by setting export VLLM_USE_TRITON_FLASH_ATTN=0.

  5. Debugging HIP Errors: The error “HIP error: invalid device function” suggests a problem with the HIP kernel. You can enable more detailed debugging by setting export AMD_SERIALIZE_KERNEL=3 and recompiling with TORCH_USE_HIP_DSA to enable device-side assertions.

  6. Alternative Approaches: If the above steps do not resolve the issue, you might consider using ZLUDA to run CUDA on AMD GPUs, but this is not guaranteed to work with vLLM and may introduce additional complexity.

Would you like more detailed guidance on any of these steps?

Sources: