Deploying multi-node LLM with infiband/ROCE

It seems likely that you are correct: the RayService CRD must specify GPU resources (e.g., nvidia.com/gpu: 8) in the resources section for both head and worker pods to allocate GPUs (KubeRay docs). For RDMA, the values for NCCL_IB_HCA and NCCL_SOCKET_IFNAME depend on the network interfaces provisioned by the Nvidia Network Operator, which is configured via the NvidiaClusterPolicy (formerly NimClusterPolicy).

  • NCCL_IB_HCA: This should be set to the device name of your InfiniBand/RDMA NIC, such as mlx5_0, mlx5_1, etc. You can find these by running ibv_devices or ibdev2netdev inside a pod with RDMA access, or by inspecting the spec.nicSelector section of your NvidiaClusterPolicy (look for device or rootDevices).
  • NCCL_SOCKET_IFNAME: This should be set to the network interface name (e.g., ib0, ens2f0, etc.) that is mapped to the RDMA device. You can find this by running ip a or ibdev2netdev inside the pod, or by checking the NvidiaClusterPolicy’s nicSelector and pfNames fields.

Example: If your NvidiaClusterPolicy has

nicSelector:
  pfNames: ["ens2f0"]
  rootDevices: ["0000:3b:00.0"]

then NCCL_SOCKET_IFNAME=ens2f0 and NCCL_IB_HCA=mlx5_0 (if mlx5_0 maps to ens2f0).

You can confirm the mapping with:

ibdev2netdev

which outputs lines like:

mlx5_0 port 1 ==> ens2f0 (Up)

So, set:

  • NCCL_IB_HCA=mlx5_0
  • NCCL_SOCKET_IFNAME=ens2f0

For more, see Nvidia Network Operator documentation and KubeRay integration. Would you like a full RayService CRD example with GPU and RDMA resources?

Sources: