vLLM + Ray: exclude/include specific GPUs on a running cluster?

Ray cluster already up (2 nodes × 2 GPUs each). Launching vLLM with --distributed-executor-backend=ray and --tensor-parallel-size 2. I need vLLM to use 1 specific GPU on node A and 1 specific GPU on node B (leave the other GPUs free) without restarting Ray.

Is there a way to instruct vLLM not to use specific GPUs from the cluster? If not, what would be an alternative way?

No, vLLM does not provide a built-in way to select specific GPU IDs across nodes in a Ray cluster. The only officially supported method for restricting GPU usage is to set the CUDA_VISIBLE_DEVICES environment variable before starting the vLLM process, but this only works per process and per node, not across an already running Ray cluster. There is no vLLM or Ray API/parameter to select arbitrary GPU IDs on different nodes at runtime; vLLM will use whatever GPUs Ray allocates to its workers, and Ray’s resource scheduling does not allow for fine-grained GPU ID selection after the cluster is started. This limitation is confirmed in multiple discussions and issues, e.g., vllm#4265, vllm#4981, and vllm#554.

The alternative is to (1) set CUDA_VISIBLE_DEVICES on each node before launching vLLM so only the desired GPU is visible, or (2) use Ray placement groups to control which nodes and how many GPUs are allocated, but not specific GPU IDs. If you need to leave some GPUs free, you must ensure those GPUs are not visible to vLLM at process start, or run vLLM in separate processes with CUDA_VISIBLE_DEVICES set accordingly on each node. There is no way to change GPU visibility or allocation for a running Ray cluster without restarting the relevant processes or the cluster itself.

Would you like more detail on how to use placement groups or CUDA_VISIBLE_DEVICES for this purpose?

Sources:

If you are using docker or kubernetes, you can use NVIDIA_VISIBLE_DEVICES environment value.

            - name: NVIDIA_VISIBLE_DEVICES
              value: "1" # It means only use index 1 GPU

or

before you run ray start, run export NVIDIA_VISIBLE_DEVICES=1 first.

Yes, in Docker or Kubernetes, setting the NVIDIA_VISIBLE_DEVICES environment variable (e.g., value: "1") restricts the container to only see and use GPU 1. This is equivalent to setting CUDA_VISIBLE_DEVICES and is the recommended way to control GPU visibility per containerized process, including for vLLM deployments in these environments. This approach is confirmed in vllm#1846.

Would you like more detail on how to configure this in your deployment YAML or Docker command?

Sources: