How to Setup InfiniBand on Azure#

Azure GPU optmized virtual machines provide a low latency and high bandwidth InfiniBand network. This guide walks through the steps to enable InfiniBand to optimize network performance.

Build a Virtual Machine#

Start by creating a GPU optimized VM from the Azure portal. Below is an example that we will use for demonstration.

  • Create new VM instance.

  • Select East US region.

  • Change Availability options to Availability set and create a set.

    • If building multiple instances put additional instances in the same set.

  • Use the 2nd Gen Ubuntu 20.04 image.

    • Search all images for Ubuntu Server 20.04 and choose the second one down on the list.

  • Change size to ND40rs_v2.

  • Set password login with credentials.

    • User someuser

    • Password somepassword

  • Leave all other options as default.

Then connect to the VM using your preferred method.

Install Software#

Before installing the drivers ensure the system is up to date.

sudo apt-get update
sudo apt-get upgrade -y

NVIDIA Drivers#

The commands below should work for Ubuntu. See the CUDA Toolkit documentation for details on installing on other operating systems.

sudo apt-get install -y linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-drivers

Restart VM instance

sudo reboot

Once the VM boots, reconnect and run nvidia-smi to verify driver installation.

nvidia-smi
Mon Nov 14 20:32:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   34C    P0    41W / 300W |    445MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000002:00:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000003:00:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000004:00:00.0 Off |                    0 |
| N/A   35C    P0    44W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000005:00:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000006:00:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000007:00:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000008:00:00.0 Off |                    0 |
| N/A   38C    P0    44W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                427MiB |
|    0   N/A  N/A      1762      G   /usr/bin/gnome-shell               16MiB |
|    1   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    6   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    7   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

InfiniBand Driver#

On Ubuntu 20.04

sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \
    libgfortran5 bison chrpath flex graphviz gfortran tk dpatch quilt swig tcl ibverbs-utils

Check install

ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.28.4000
        node_guid:                      0015:5dff:fe33:ff2c
        sys_image_guid:                 0c42:a103:00b3:2f68
        vendor_id:                      0x02c9
        vendor_part_id:                 4120
        hw_ver:                         0x0
        board_id:                       MT_0000000010
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 7
                        port_lid:               115
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: rdmaP36305p0s2
        transport:                      InfiniBand (0)
        fw_ver:                         2.43.7008
        node_guid:                      6045:bdff:feed:8445
        sys_image_guid:                 043f:7203:0003:d583
        vendor_id:                      0x02c9
        vendor_part_id:                 4100
        hw_ver:                         0x0
        board_id:                       MT_1090111019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

Enable IPoIB#

sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf

Reboot and reconnect.

sudo reboot

Check IB#

ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
    inet 10.6.0.5/24 brd 10.6.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::6245:bdff:fea7:42cc/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:15:5d:33:ff:16 brd ff:ff:ff:ff:ff:ff
4: enP44906s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
    altname enP44906p0s2
5: ibP59423s2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 00:00:09:27:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:16 brd 00:ff:ff:ff:ff:12:40:1b:80:1d:00:00:00:00:00:00:ff:ff:ff:ff
    altname ibP59423p0s2
nvidia-smi topo -m
        GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7  mlx5_0  CPU Affinity  NUMA Affinity
GPU0    X     NV2   NV1   NV2   NODE  NODE  NV1   NODE  NODE    0-19          0
GPU1    NV2   X     NV2   NV1   NODE  NODE  NODE  NV1   NODE    0-19          0
GPU2    NV1   NV2   X     NV1   NV2   NODE  NODE  NODE  NODE    0-19          0
GPU3    NV2   NV1   NV1   X     NODE  NV2   NODE  NODE  NODE    0-19          0
GPU4    NODE  NODE  NV2   NODE  X     NV1   NV1   NV2   NODE    0-19          0
GPU5    NODE  NODE  NODE  NV2   NV1   X     NV2   NV1   NODE    0-19          0
GPU6    NV1   NODE  NODE  NODE  NV1   NV2   X     NV2   NODE    0-19          0
GPU7    NODE  NV1   NODE  NODE  NV2   NV1   NV2   X     NODE    0-19          0
mlx5_0  NODE  NODE  NODE  NODE  NODE  NODE  NODE  NODE  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Install UCX-Py and tools#

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh

Accept the default and allow conda init to run. Then start a new shell.

Create a conda environment (see UCX-Py docs)

mamba create -n ucxpy -c rapidsai -c conda-forge -c nvidia rapids=24.10 python=3.12 cuda-version=12.5 ipython ucx-proc=*=gpu ucx ucx-py dask distributed numpy cupy pytest pynvml -y
mamba activate ucxpy

Clone UCX-Py repo locally

git clone https://github.com/rapidsai/ucx-py.git
cd ucx-py

Run Tests#

Start by running the UCX-Py test suite, from within the ucx-py repo:

pytest -vs tests/
pytest -vs ucp/_libs/tests/

Now check to see if InfiniBand works, for that you can run some of the benchmarks that we include in UCX-Py, for example:

# cd out of the ucx-py directory
cd ..
# Let UCX pick the best transport (expecting NVLink when available,
# otherwise InfiniBand, or TCP in worst case) on devices 0 and 1
python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB

# Force TCP-only on devices 0 and 1
UCX_TLS=tcp,cuda_copy python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB

We expect the first case above to have much higher bandwidth than the second. If you happen to have both NVLink and InfiniBand connectivity, then you may limit to the specific transport by specifying UCX_TLS, e.g.:

# NVLink (if available) or TCP
UCX_TLS=tcp,cuda_copy,cuda_ipc

# InfiniBand (if available) or TCP
UCX_TLS=tcp,cuda_copy,rc

Run Benchmarks#

Finally, let’s run the merge benchmark from dask-cuda.

This benchmark uses Dask to perform a merge of two dataframes that are distributed across all the available GPUs on your VM. Merges are a challenging benchmark in a distributed setting since they require communication-intensive shuffle operations of the participating dataframes (see the Dask documentation for more on this type of operation). To perform the merge, each dataframe is shuffled such that rows with the same join key appear on the same GPU. This results in an all-to-all communication pattern which requires a lot of communication between the GPUs. As a result, network performance will be very important for the throughput of the benchmark.

Below we are running for devices 0 through 7 (inclusive), you will want to adjust that for the number of devices available on your VM, the default is to run on GPU 0 only. Additionally, --chunk-size 100_000_000 is a safe value for 32GB GPUs, you may adjust that proportional to the size of the GPU you have (it scales linearly, so 50_000_000 should be good for 16GB or 150_000_000 for 48GB).

# Default Dask TCP communication protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --no-show-p2p-bandwidth
Merge benchmark
--------------------------------------------------------------------------------
Backend                   | dask
Merge type                | gpu
Rows-per-chunk            | 100000000
Base-chunks               | 8
Other-chunks              | 8
Broadcast                 | default
Protocol                  | tcp
Device(s)                 | 0,1,2,3,4,5,6,7
RMM Pool                  | True
Frac-match                | 0.3
Worker thread(s)          | 1
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
48.51 s                   | 503.25 MiB/s
47.85 s                   | 510.23 MiB/s
41.20 s                   | 592.57 MiB/s
================================================================================
Throughput                | 532.43 MiB/s +/- 22.13 MiB/s
Bandwidth                 | 44.76 MiB/s +/- 0.93 MiB/s
Wall clock                | 45.85 s +/- 3.30 s
# UCX protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx --no-show-p2p-bandwidth
Merge benchmark
--------------------------------------------------------------------------------
Backend                   | dask
Merge type                | gpu
Rows-per-chunk            | 100000000
Base-chunks               | 8
Other-chunks              | 8
Broadcast                 | default
Protocol                  | ucx
Device(s)                 | 0,1,2,3,4,5,6,7
RMM Pool                  | True
Frac-match                | 0.3
TCP                       | None
InfiniBand                | None
NVLink                    | None
Worker thread(s)          | 1
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
9.57 s                    | 2.49 GiB/s
6.01 s                    | 3.96 GiB/s
9.80 s                    | 2.43 GiB/s
================================================================================
Throughput                | 2.82 GiB/s +/- 341.13 MiB/s
Bandwidth                 | 159.89 MiB/s +/- 8.96 MiB/s
Wall clock                | 8.46 s +/- 1.73 s