vLLM does not natively support running a single model across heterogeneous GPUs (different vendors or architectures, e.g., mixing Hygon DCUs and NVIDIA GPUs) in one inference instance. vLLM’s distributed inference (tensor/pipeline parallelism) requires all participating GPUs to be of the same type and architecture for correct operation and performance, as per official documentation and parallelism guidelines.
You can, however, run separate vLLM instances on each GPU type (e.g., one instance on Hygon DCUs, another on NVIDIA GPUs) and manage them independently, but you cannot shard a single model across both Hygon and NVIDIA GPUs in one vLLM process. For multi-GPU parallelism, use only identical GPUs within each vLLM instance, and select which GPUs to use via the CUDA_VISIBLE_DEVICES environment variable for NVIDIA cards or the equivalent for Hygon DCUs, as described in this issue. Would you like more detail on how to set up vLLM for each GPU type separately?