How to setup amd gpu as default in dual stack gpu?

@CarlosR759
Based on my understanding, with vLLM, the speculative decoding model will need to share the same GPU as the large model. There isn’t a configuration that allows speculative decoding model to be on one GPU, and the base model to be on another GPU.

Descriptively, the following is possible:

  1. Base model TP2, draft model
GPU 0 GPU 1
Base Model (first half) Base Model (second half)
Draft Model
  1. The following is not possible yet
GPU 0 GPU 1
Draft Model Base Model

Moreover, the draft model in speculative decoding currently needs to be run without tensor parallelism, meaning draft_tensor_parallel_size should be set to 1.