vLLM + Ray: exclude/include specific GPUs on a running cluster?

aakulish-rz · August 14, 2025, 6:55am

Ray cluster already up (2 nodes × 2 GPUs each). Launching vLLM with --distributed-executor-backend=ray and --tensor-parallel-size 2. I need vLLM to use 1 specific GPU on node A and 1 specific GPU on node B (leave the other GPUs free) without restarting Ray.

Is there a way to instruct vLLM not to use specific GPUs from the cluster? If not, what would be an alternative way?

RunLLM · August 14, 2025, 6:56am

No, vLLM does not provide a built-in way to select specific GPU IDs across nodes in a Ray cluster. The only officially supported method for restricting GPU usage is to set the CUDA_VISIBLE_DEVICES environment variable before starting the vLLM process, but this only works per process and per node, not across an already running Ray cluster. There is no vLLM or Ray API/parameter to select arbitrary GPU IDs on different nodes at runtime; vLLM will use whatever GPUs Ray allocates to its workers, and Ray’s resource scheduling does not allow for fine-grained GPU ID selection after the cluster is started. This limitation is confirmed in multiple discussions and issues, e.g., vllm#4265, vllm#4981, and vllm#554.

The alternative is to (1) set CUDA_VISIBLE_DEVICES on each node before launching vLLM so only the desired GPU is visible, or (2) use Ray placement groups to control which nodes and how many GPUs are allocated, but not specific GPU IDs. If you need to leave some GPUs free, you must ensure those GPUs are not visible to vLLM at process start, or run vLLM in separate processes with CUDA_VISIBLE_DEVICES set accordingly on each node. There is no way to change GPU visibility or allocation for a running Ray cluster without restarting the relevant processes or the cluster itself.

Would you like more detail on how to use placement groups or CUDA_VISIBLE_DEVICES for this purpose?

Sources:

Zerohertz · August 14, 2025, 7:05am

If you are using docker or kubernetes, you can use NVIDIA_VISIBLE_DEVICES environment value.

            - name: NVIDIA_VISIBLE_DEVICES
              value: "1" # It means only use index 1 GPU

or

before you run ray start, run export NVIDIA_VISIBLE_DEVICES=1 first.

RunLLM · August 14, 2025, 7:05am

Yes, in Docker or Kubernetes, setting the NVIDIA_VISIBLE_DEVICES environment variable (e.g., value: "1") restricts the container to only see and use GPU 1. This is equivalent to setting CUDA_VISIBLE_DEVICES and is the recommended way to control GPU visibility per containerized process, including for vLLM deployments in these environments. This approach is confirmed in vllm#1846.

Would you like more detail on how to configure this in your deployment YAML or Docker command?

Sources:

vllm#1846

Topic		Replies	Views
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	357	July 6, 2025
vLLM on 4 nodes fails randomly General	1	76	April 6, 2026
Command lines to run pp=2 on two nodes in vllm 0.10.x General	2	93	November 20, 2025
Deploying multi-node LLM with infiband/ROCE General	11	1136	August 15, 2025
Running vLLM multi-node data parallel with SLURM RL Integration	1	1081	August 12, 2025

vLLM + Ray: exclude/include specific GPUs on a running cluster?

Related topics