# How to Setup InfiniBand on Azure [Azure GPU optmized virtual machines](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provide a low latency and high bandwidth InfiniBand network. This guide walks through the steps to enable InfiniBand to optimize network performance. ## Build a Virtual Machine Start by creating a GPU optimized VM from the Azure portal. Below is an example that we will use for demonstration. - Create new VM instance. - Select `East US` region. - Change `Availability options` to `Availability set` and create a set. - If building multiple instances put additional instances in the same set. - Use the 2nd Gen Ubuntu 20.04 image. - Search all images for `Ubuntu Server 20.04` and choose the second one down on the list. - Change size to `ND40rs_v2`. - Set password login with credentials. - User `someuser` - Password `somepassword` - Leave all other options as default. Then connect to the VM using your preferred method. ## Install Software Before installing the drivers ensure the system is up to date. ```shell sudo apt-get update sudo apt-get upgrade -y ``` ### NVIDIA Drivers The commands below should work for Ubuntu. See the [CUDA Toolkit documentation](https://docs.nvidia.com/cuda/index.html#installation-guides) for details on installing on other operating systems. ```shell sudo apt-get install -y linux-headers-$(uname -r) distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb sudo dpkg -i cuda-keyring_1.0-1_all.deb sudo apt-get update sudo apt-get -y install cuda-drivers ``` Restart VM instance ```shell sudo reboot ``` Once the VM boots, reconnect and run `nvidia-smi` to verify driver installation. ```shell nvidia-smi ``` ```shell Mon Nov 14 20:32:39 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000001:00:00.0 Off | 0 | | N/A 34C P0 41W / 300W | 445MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000002:00:00.0 Off | 0 | | N/A 37C P0 43W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000003:00:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000004:00:00.0 Off | 0 | | N/A 35C P0 44W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000005:00:00.0 Off | 0 | | N/A 35C P0 41W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000006:00:00.0 Off | 0 | | N/A 36C P0 43W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000007:00:00.0 Off | 0 | | N/A 37C P0 44W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000008:00:00.0 Off | 0 | | N/A 38C P0 44W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1396 G /usr/lib/xorg/Xorg 427MiB | | 0 N/A N/A 1762 G /usr/bin/gnome-shell 16MiB | | 1 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 4 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 5 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 6 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 7 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ ``` ### InfiniBand Driver On Ubuntu 20.04 ```shell sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \ libgfortran5 bison chrpath flex graphviz gfortran tk dpatch quilt swig tcl ibverbs-utils ``` Check install ```shell ibv_devinfo ``` ```shell hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 16.28.4000 node_guid: 0015:5dff:fe33:ff2c sys_image_guid: 0c42:a103:00b3:2f68 vendor_id: 0x02c9 vendor_part_id: 4120 hw_ver: 0x0 board_id: MT_0000000010 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 7 port_lid: 115 port_lmc: 0x00 link_layer: InfiniBand hca_id: rdmaP36305p0s2 transport: InfiniBand (0) fw_ver: 2.43.7008 node_guid: 6045:bdff:feed:8445 sys_image_guid: 043f:7203:0003:d583 vendor_id: 0x02c9 vendor_part_id: 4100 hw_ver: 0x0 board_id: MT_1090111019 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet ``` #### Enable IPoIB ```shell sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf ``` Reboot and reconnect. ```shell sudo reboot ``` #### Check IB ```shell ip addr show ``` ```shell 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff inet 10.6.0.5/24 brd 10.6.0.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::6245:bdff:fea7:42cc/64 scope link valid_lft forever preferred_lft forever 3: eth1: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:15:5d:33:ff:16 brd ff:ff:ff:ff:ff:ff 4: enP44906s1: mtu 1500 qdisc mq master eth0 state UP group default qlen 1000 link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff altname enP44906p0s2 5: ibP59423s2: mtu 4092 qdisc noop state DOWN group default qlen 256 link/infiniband 00:00:09:27:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:16 brd 00:ff:ff:ff:ff:12:40:1b:80:1d:00:00:00:00:00:00:ff:ff:ff:ff altname ibP59423p0s2 ``` ```shell nvidia-smi topo -m ``` ```shell GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity NUMA Affinity GPU0 X NV2 NV1 NV2 NODE NODE NV1 NODE NODE 0-19 0 GPU1 NV2 X NV2 NV1 NODE NODE NODE NV1 NODE 0-19 0 GPU2 NV1 NV2 X NV1 NV2 NODE NODE NODE NODE 0-19 0 GPU3 NV2 NV1 NV1 X NODE NV2 NODE NODE NODE 0-19 0 GPU4 NODE NODE NV2 NODE X NV1 NV1 NV2 NODE 0-19 0 GPU5 NODE NODE NODE NV2 NV1 X NV2 NV1 NODE 0-19 0 GPU6 NV1 NODE NODE NODE NV1 NV2 X NV2 NODE 0-19 0 GPU7 NODE NV1 NODE NODE NV2 NV1 NV2 X NODE 0-19 0 mlx5_0 NODE NODE NODE NODE NODE NODE NODE NODE X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ``` ### Install UCX-Py and tools ```shell wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh bash Mambaforge-Linux-x86_64.sh ``` Accept the default and allow conda init to run. Then start a new shell. Create a conda environment (see [UCX-Py](https://ucx-py.readthedocs.io/en/latest/install.html) docs) ```shell mamba create -n ucxpy {{ rapids_conda_channels }} {{ rapids_conda_packages }} ipython ucx-proc=*=gpu ucx ucx-py dask distributed numpy cupy pytest pynvml -y mamba activate ucxpy ``` Clone UCX-Py repo locally ```shell git clone https://github.com/rapidsai/ucx-py.git cd ucx-py ``` ### Run Tests Start by running the UCX-Py test suite, from within the `ucx-py` repo: ```shell pytest -vs tests/ pytest -vs ucp/_libs/tests/ ``` Now check to see if InfiniBand works, for that you can run some of the benchmarks that we include in UCX-Py, for example: ```shell # cd out of the ucx-py directory cd .. # Let UCX pick the best transport (expecting NVLink when available, # otherwise InfiniBand, or TCP in worst case) on devices 0 and 1 python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB # Force TCP-only on devices 0 and 1 UCX_TLS=tcp,cuda_copy python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB ``` We expect the first case above to have much higher bandwidth than the second. If you happen to have both NVLink and InfiniBand connectivity, then you may limit to the specific transport by specifying `UCX_TLS`, e.g.: ```shell # NVLink (if available) or TCP UCX_TLS=tcp,cuda_copy,cuda_ipc # InfiniBand (if available) or TCP UCX_TLS=tcp,cuda_copy,rc ``` ## Run Benchmarks Finally, let's run the [merge benchmark](https://github.com/rapidsai/dask-cuda/blob/HEAD/dask_cuda/benchmarks/local_cudf_merge.py) from `dask-cuda`. This benchmark uses Dask to perform a merge of two dataframes that are distributed across all the available GPUs on your VM. Merges are a challenging benchmark in a distributed setting since they require communication-intensive shuffle operations of the participating dataframes (see the [Dask documentation](https://docs.dask.org/en/stable/dataframe-best-practices.html#avoid-full-data-shuffling) for more on this type of operation). To perform the merge, each dataframe is shuffled such that rows with the same join key appear on the same GPU. This results in an [all-to-all]() communication pattern which requires a lot of communication between the GPUs. As a result, network performance will be very important for the throughput of the benchmark. Below we are running for devices 0 through 7 (inclusive), you will want to adjust that for the number of devices available on your VM, the default is to run on GPU 0 only. Additionally, `--chunk-size 100_000_000` is a safe value for 32GB GPUs, you may adjust that proportional to the size of the GPU you have (it scales linearly, so `50_000_000` should be good for 16GB or `150_000_000` for 48GB). ```shell # Default Dask TCP communication protocol python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --no-show-p2p-bandwidth ``` ```shell Merge benchmark -------------------------------------------------------------------------------- Backend | dask Merge type | gpu Rows-per-chunk | 100000000 Base-chunks | 8 Other-chunks | 8 Broadcast | default Protocol | tcp Device(s) | 0,1,2,3,4,5,6,7 RMM Pool | True Frac-match | 0.3 Worker thread(s) | 1 Data processed | 23.84 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 48.51 s | 503.25 MiB/s 47.85 s | 510.23 MiB/s 41.20 s | 592.57 MiB/s ================================================================================ Throughput | 532.43 MiB/s +/- 22.13 MiB/s Bandwidth | 44.76 MiB/s +/- 0.93 MiB/s Wall clock | 45.85 s +/- 3.30 s ``` ```shell # UCX protocol python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx --no-show-p2p-bandwidth ``` ```shell Merge benchmark -------------------------------------------------------------------------------- Backend | dask Merge type | gpu Rows-per-chunk | 100000000 Base-chunks | 8 Other-chunks | 8 Broadcast | default Protocol | ucx Device(s) | 0,1,2,3,4,5,6,7 RMM Pool | True Frac-match | 0.3 TCP | None InfiniBand | None NVLink | None Worker thread(s) | 1 Data processed | 23.84 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 9.57 s | 2.49 GiB/s 6.01 s | 3.96 GiB/s 9.80 s | 2.43 GiB/s ================================================================================ Throughput | 2.82 GiB/s +/- 341.13 MiB/s Bandwidth | 159.89 MiB/s +/- 8.96 MiB/s Wall clock | 8.46 s +/- 1.73 s ``` ```{relatedexamples} ```