The issue is caused by CUDA device ordering: by default, CUDA may assign device 0 to the GPU with the highest compute capability, not by PCI bus ID, which can confuse vLLM’s device selection logic. Even if you set CUDA_VISIBLE_DEVICES, the device order may not match what vLLM expects, leading to errors about unsupported features on the 2080ti. The recommended workaround is to set both CUDA_VISIBLE_DEVICES and CUDA_DEVICE_ORDER=PCI_BUS_ID before launching vLLM. This ensures device ordering matches what vLLM and PyTorch expect, and the correct GPUs are used for inference. VM GPU passthrough is not required for this workaround to work.
Example:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2 # (replace with your 3090 device IDs)
python -m vllm.entrypoints.openai.api_server ...