@CarlosR759
Based on my understanding, with vLLM, the speculative decoding model will need to share the same GPU as the large model. There isn’t a configuration that allows speculative decoding model to be on one GPU, and the base model to be on another GPU.
Descriptively, the following is possible:
- Base model TP2, draft model
| GPU 0 | GPU 1 |
|---|---|
| Base Model (first half) | Base Model (second half) |
| Draft Model |
- The following is not possible yet
| GPU 0 | GPU 1 |
|---|---|
| Draft Model | Base Model |
Moreover, the draft model in speculative decoding currently needs to be run without tensor parallelism, meaning draft_tensor_parallel_size should be set to 1.