Deploying multi-node LLM with infiband/ROCE

anticool · August 9, 2025, 12:24am

We have multiple nodes with 8 GPUs and use RDMA capable NICs and switches. In Kubernetes, if I want to deploy an LLM using GPUs from multiple nodes, except the pipeline and tensor parallelism, do I need to set any argument or environment parameter to make sure that RDMA is enabled? For Kubernetes, i assume deployment is done via Rayservice CRD, but with what RDMA parameters?

RunLLM · August 9, 2025, 12:24am

To enable RDMA for vLLM distributed inference on Kubernetes, you must set specific NCCL environment variables in your pod/container spec, such as NCCL_IB_HCA=mlx5 (or your device), and ensure NCCL_IB_DISABLE=0. Also, add the IPC_LOCK capability and mount /dev/shm for shared memory. These settings are required in addition to configuring tensor and pipeline parallelism. For RayService CRD, set these environment variables in the container spec of your Ray head and worker pods. There is no vLLM-specific RDMA flag; all RDMA enablement is via NCCL and container configuration. To verify RDMA is used, set NCCL_DEBUG=TRACE and check logs for [send] via NET/IB/GDRDMA (docs, example YAML).

Example Kubernetes YAML snippet:

env:
  - name: NCCL_IB_HCA
    value: "mlx5_0"
  - name: NCCL_IB_DISABLE
    value: "0"
  - name: NCCL_DEBUG
    value: "TRACE"
securityContext:
  capabilities:
    add: ["IPC_LOCK"]
volumeMounts:
  - mountPath: /dev/shm
    name: dshm

No additional vLLM CLI arguments are needed for RDMA; all configuration is through environment variables and pod spec (docs, KubeRay integration). Would you like a full example YAML or more details on RayService CRD integration?

Sources:

anticool · August 13, 2025, 8:30pm

I assume the Nvidia/Mellanox network operator is needed to enable Infiband or ROCE. In that case, will the vllm deployment in Kubernetes use the secondary interface (e.g. MACVLAN or SRIOV)? Will Rayservice be applicable?

RunLLM · August 13, 2025, 8:31pm

Yes, to enable InfiniBand or RoCE in Kubernetes, you typically need the Nvidia/Mellanox network operator to provision SR-IOV or MACVLAN secondary interfaces for RDMA (KubeRay integration). vLLM and Ray can use these secondary interfaces if you set the correct NCCL environment variables (e.g., NCCL_SOCKET_IFNAME to the RDMA interface name, such as ib0 or the SR-IOV device) in your pod spec. RayService is fully applicable; just ensure the environment variables are set in both head and worker pods so NCCL uses the RDMA-enabled interface (example, distributed serving docs).

You must explicitly set NCCL_SOCKET_IFNAME to the RDMA interface name to avoid NCCL defaulting to the wrong interface, especially in multi-homed pods. RayService will orchestrate the cluster as usual; the key is correct pod-level network and environment configuration. Would you like a sample RayService CRD or pod spec for this setup?

Sources:

anticool · August 13, 2025, 8:34pm

Yes provide RayService example to address this for both sriov or when hsing dedicated NIC per GPU. What is the ib0?

RunLLM · August 13, 2025, 8:36pm

It seems likely that for SR-IOV or dedicated NIC per GPU, you should set the NCCL_SOCKET_IFNAME environment variable to the correct network interface name (e.g., ib0 for InfiniBand, or the SR-IOV-assigned device name) in your RayService pod spec. ib0 is the typical default name for the first InfiniBand interface, but with SR-IOV or custom CNI, your interface may be named differently (e.g., ens2f0, mlx5_0, or a custom MACVLAN/SRIOV device). You can find the correct name by running ip a inside the pod.

Here is a RayService YAML snippet for KubeRay with the relevant environment variables for RDMA (InfiniBand or RoCE) and SR-IOV/dedicated NICs:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-rayservice
spec:
  rayClusterConfig:
    rayVersion: '2.9.0'
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
            - name: ray-head
              image: vllm/vllm-openai:latest
              env:
                - name: NCCL_IB_HCA
                  value: "mlx5_0"  # or your device
                - name: NCCL_IB_DISABLE
                  value: "0"
                - name: NCCL_SOCKET_IFNAME
                  value: "ib0"  # or your SR-IOV/MACVLAN device name
                - name: NCCL_DEBUG
                  value: "INFO"
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
    workerGroupSpecs:
      - groupName: worker-group
        replicas: 2
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: vllm/vllm-openai:latest
                env:
                  - name: NCCL_IB_HCA
                    value: "mlx5_0"
                  - name: NCCL_IB_DISABLE
                    value: "0"
                  - name: NCCL_SOCKET_IFNAME
                    value: "ib0"
                  - name: NCCL_DEBUG
                    value: "INFO"
                securityContext:
                  capabilities:
                    add: ["IPC_LOCK"]
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
              - name: dshm
                emptyDir:
                  medium: Memory

Replace "ib0" with your actual RDMA interface name as needed. RayService will work as long as the environment variables are set and the network operator provisions the correct interfaces (example, KubeRay docs). Would you like more details on interface discovery or SR-IOV setup?

Sources:

anticool · August 13, 2025, 8:42pm

This Rayservice CRD seems incorrect, because there are no GPU resources allocated. From nvidia network operator how can I find the values NCCL_IB_HCA and NCCL_SOCKET_IFNAME? Could you link them to the NimClusterPolicy?

RunLLM · August 13, 2025, 8:42pm

It seems likely that you are correct: the RayService CRD must specify GPU resources (e.g., nvidia.com/gpu: 8) in the resources section for both head and worker pods to allocate GPUs (KubeRay docs). For RDMA, the values for NCCL_IB_HCA and NCCL_SOCKET_IFNAME depend on the network interfaces provisioned by the Nvidia Network Operator, which is configured via the NvidiaClusterPolicy (formerly NimClusterPolicy).

NCCL_IB_HCA: This should be set to the device name of your InfiniBand/RDMA NIC, such as mlx5_0, mlx5_1, etc. You can find these by running ibv_devices or ibdev2netdev inside a pod with RDMA access, or by inspecting the spec.nicSelector section of your NvidiaClusterPolicy (look for device or rootDevices).
NCCL_SOCKET_IFNAME: This should be set to the network interface name (e.g., ib0, ens2f0, etc.) that is mapped to the RDMA device. You can find this by running ip a or ibdev2netdev inside the pod, or by checking the NvidiaClusterPolicy’s nicSelector and pfNames fields.

Example: If your NvidiaClusterPolicy has

nicSelector:
  pfNames: ["ens2f0"]
  rootDevices: ["0000:3b:00.0"]

then NCCL_SOCKET_IFNAME=ens2f0 and NCCL_IB_HCA=mlx5_0 (if mlx5_0 maps to ens2f0).

You can confirm the mapping with:

ibdev2netdev

which outputs lines like:

mlx5_0 port 1 ==> ens2f0 (Up)

So, set:

NCCL_IB_HCA=mlx5_0
NCCL_SOCKET_IFNAME=ens2f0

For more, see Nvidia Network Operator documentation and KubeRay integration. Would you like a full RayService CRD example with GPU and RDMA resources?

Sources:

Example with RDMA env vars

anticool · August 15, 2025, 3:20pm

In Network Operator and NicClusterPolicy, I understand that ROCE will be implemented with IP (v2) and secondary interface is needed for the pods. That could be a MACVLAN network. How will the vLLM deployment in Kubernetes be impacted this? How will we be able to deploy the LLM in multiple nodes, using RayService and utilize RDMA devices via secondary interfaces like MACVLAN?

RunLLM · August 15, 2025, 3:21pm

When using the Nvidia Network Operator with a NicClusterPolicy to provision ROCE (RDMA over Converged Ethernet) via MACVLAN secondary interfaces, vLLM deployment in Kubernetes is impacted in these ways:

Pod Networking: Each pod running vLLM (Ray head/worker) must be attached to the MACVLAN network that exposes the RDMA-capable interface. This is typically done using Multus CNI, which allows pods to have multiple network interfaces—one for regular pod traffic and one for RDMA/ROCE (Nvidia Network Operator docs).
NCCL Environment Variables: In your RayService CRD, set NCCL_SOCKET_IFNAME to the MACVLAN interface name (e.g., net1, macvlan0, etc.) and NCCL_IB_HCA to the corresponding device (e.g., mlx5_0). These must match the interfaces provided by the Network Operator and Multus. You can discover the correct names by running ip a and ibdev2netdev inside the pod.
RayService Deployment: RayService is fully compatible. In your RayService CRD, ensure:
- GPU resources are requested (nvidia.com/gpu: N).
- The pod spec includes the Multus network annotation to attach the MACVLAN.
- The environment variables above are set in both head and worker pods.

Summary: vLLM will use RDMA over the MACVLAN interface as long as the pod is attached to the correct network and the NCCL environment variables point to the right interface. RayService will orchestrate the cluster as usual; the key is correct pod-level network and environment configuration (KubeRay integration, example with env vars). Would you like a YAML example for this setup?

Sources:

anticool · August 15, 2025, 3:24pm

The head node will not manage GPUs, do we need to assgin secondary interface? If we have multiple RDMA interfaces (for example 4, because we have 4 GPUs per node) assigned to the LLM, how the NCCL_SOCKET_IFNAME and NCCL_IB_HCA be impacted? Can you provide the RayService example for this deployment? RDMA with MACVLAN, 4 GPUs per node and 4 RDMA interfaces

RunLLM · August 15, 2025, 3:26pm

If the Ray head node does not run LLM inference (i.e., does not manage GPUs), it does not need the RDMA/MACVLAN secondary interface—only the worker pods that run vLLM and use GPUs require it (KubeRay integration). For multiple RDMA interfaces (e.g., 4 per node for 4 GPUs), you can set NCCL_SOCKET_IFNAME and NCCL_IB_HCA to a comma-separated list of interface/device names, or use wildcards (e.g., mlx5_*), so NCCL can select the correct interface for each GPU (NCCL docs).

Example RayService CRD for 4 GPUs/node, 4 RDMA MACVLAN interfaces:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-rayservice
spec:
  rayClusterConfig:
    rayVersion: '2.9.0'
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: ""  # No MACVLAN needed for head
        spec:
          containers:
            - name: ray-head
              image: vllm/vllm-openai:latest
              resources:
                limits:
                  nvidia.com/gpu: 0
    workerGroupSpecs:
      - groupName: worker-group
        replicas: 2
        rayStartParams: {}
        template:
          metadata:
            annotations:
              k8s.v1.cni.cncf.io/networks: 'macvlan-rdma-net,macvlan-rdma-net,macvlan-rdma-net,macvlan-rdma-net'
          spec:
            containers:
              - name: ray-worker
                image: vllm/vllm-openai:latest
                resources:
                  limits:
                    nvidia.com/gpu: 4
                env:
                  - name: NCCL_IB_HCA
                    value: "mlx5_0,mlx5_1,mlx5_2,mlx5_3"  # Or use "mlx5_*"
                  - name: NCCL_SOCKET_IFNAME
                    value: "net1,net2,net3,net4"  # MACVLAN interfaces as seen in pod (adjust as needed)
                  - name: NCCL_IB_DISABLE
                    value: "0"
                  - name: NCCL_DEBUG
                    value: "INFO"
                securityContext:
                  capabilities:
                    add: ["IPC_LOCK"]
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
              - name: dshm
                emptyDir:
                  medium: Memory

Adjust NCCL_SOCKET_IFNAME to match the actual MACVLAN interface names in the pod (e.g., net1,net2,net3,net4).
Adjust NCCL_IB_HCA to match the device names (mlx5_0,mlx5_1,mlx5_2,mlx5_3).
The head node does not need the MACVLAN or GPU resources.

This setup ensures each GPU can use its dedicated RDMA interface. Would you like more detail on interface discovery or Multus/NetworkAttachmentDefinition setup?

Sources:

Example with RDMA env vars

Topic		Replies	Views
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	170	July 6, 2025
使用vllm ray 在多机多卡上部署推理服务 General	5	284	August 8, 2025
Multi-node K8s GPU pooling Features	3	141	July 17, 2025
Running vLLM multi-node data parallel with SLURM RL Integration	1	298	August 12, 2025
vLLM cannot connect to existing Ray cluster Ray	16	744	May 8, 2025

Deploying multi-node LLM with infiband/ROCE

Related topics