How to Setup InfiniBand on Azure#
Azure GPU optmized virtual machines provide a low latency and high bandwidth InfiniBand network. This guide walks through the steps to enable InfiniBand to optimize network performance.
Build a Virtual Machine#
Start by creating a GPU optimized VM from the Azure portal. Below is an example that we will use for demonstration.
Create new VM instance.
Select
East US
region.Change
Availability options
toAvailability set
and create a set.If building multiple instances put additional instances in the same set.
Use the 2nd Gen Ubuntu 20.04 image.
Search all images for
Ubuntu Server 20.04
and choose the second one down on the list.
Change size to
ND40rs_v2
.Set password login with credentials.
User
someuser
Password
somepassword
Leave all other options as default.
Then connect to the VM using your preferred method.
Install Software#
Before installing the drivers ensure the system is up to date.
sudo apt-get update
sudo apt-get upgrade -y
NVIDIA Drivers#
The commands below should work for Ubuntu. See the CUDA Toolkit documentation for details on installing on other operating systems.
sudo apt-get install -y linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-drivers
Restart VM instance
sudo reboot
Once the VM boots, reconnect and run nvidia-smi
to verify driver installation.
nvidia-smi
Mon Nov 14 20:32:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000001:00:00.0 Off | 0 |
| N/A 34C P0 41W / 300W | 445MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000002:00:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000003:00:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000004:00:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000005:00:00.0 Off | 0 |
| N/A 35C P0 41W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000006:00:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000007:00:00.0 Off | 0 |
| N/A 37C P0 44W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000008:00:00.0 Off | 0 |
| N/A 38C P0 44W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1396 G /usr/lib/xorg/Xorg 427MiB |
| 0 N/A N/A 1762 G /usr/bin/gnome-shell 16MiB |
| 1 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
InfiniBand Driver#
On Ubuntu 20.04
sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \
libgfortran5 bison chrpath flex graphviz gfortran tk dpatch quilt swig tcl ibverbs-utils
Check install
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.28.4000
node_guid: 0015:5dff:fe33:ff2c
sys_image_guid: 0c42:a103:00b3:2f68
vendor_id: 0x02c9
vendor_part_id: 4120
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 7
port_lid: 115
port_lmc: 0x00
link_layer: InfiniBand
hca_id: rdmaP36305p0s2
transport: InfiniBand (0)
fw_ver: 2.43.7008
node_guid: 6045:bdff:feed:8445
sys_image_guid: 043f:7203:0003:d583
vendor_id: 0x02c9
vendor_part_id: 4100
hw_ver: 0x0
board_id: MT_1090111019
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
Enable IPoIB#
sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf
Reboot and reconnect.
sudo reboot
Check IB#
ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
inet 10.6.0.5/24 brd 10.6.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::6245:bdff:fea7:42cc/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 00:15:5d:33:ff:16 brd ff:ff:ff:ff:ff:ff
4: enP44906s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
altname enP44906p0s2
5: ibP59423s2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 00:00:09:27:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:16 brd 00:ff:ff:ff:ff:12:40:1b:80:1d:00:00:00:00:00:00:ff:ff:ff:ff
altname ibP59423p0s2
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NV2 NV1 NV2 NODE NODE NV1 NODE NODE 0-19 0
GPU1 NV2 X NV2 NV1 NODE NODE NODE NV1 NODE 0-19 0
GPU2 NV1 NV2 X NV1 NV2 NODE NODE NODE NODE 0-19 0
GPU3 NV2 NV1 NV1 X NODE NV2 NODE NODE NODE 0-19 0
GPU4 NODE NODE NV2 NODE X NV1 NV1 NV2 NODE 0-19 0
GPU5 NODE NODE NODE NV2 NV1 X NV2 NV1 NODE 0-19 0
GPU6 NV1 NODE NODE NODE NV1 NV2 X NV2 NODE 0-19 0
GPU7 NODE NV1 NODE NODE NV2 NV1 NV2 X NODE 0-19 0
mlx5_0 NODE NODE NODE NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Install UCX-Py and tools#
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh
Accept the default and allow conda init to run. Then start a new shell.
Create a conda environment (see UCX-Py docs)
mamba create -n ucxpy -c rapidsai -c conda-forge -c nvidia rapids=24.10 python=3.12 cuda-version=12.5 ipython ucx-proc=*=gpu ucx ucx-py dask distributed numpy cupy pytest pynvml -y mamba activate ucxpy
Clone UCX-Py repo locally
git clone https://github.com/rapidsai/ucx-py.git
cd ucx-py
Run Tests#
Start by running the UCX-Py test suite, from within the ucx-py
repo:
pytest -vs tests/
pytest -vs ucp/_libs/tests/
Now check to see if InfiniBand works, for that you can run some of the benchmarks that we include in UCX-Py, for example:
# cd out of the ucx-py directory
cd ..
# Let UCX pick the best transport (expecting NVLink when available,
# otherwise InfiniBand, or TCP in worst case) on devices 0 and 1
python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB
# Force TCP-only on devices 0 and 1
UCX_TLS=tcp,cuda_copy python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB
We expect the first case above to have much higher bandwidth than the second. If you happen to have both
NVLink and InfiniBand connectivity, then you may limit to the specific transport by specifying UCX_TLS
, e.g.:
# NVLink (if available) or TCP
UCX_TLS=tcp,cuda_copy,cuda_ipc
# InfiniBand (if available) or TCP
UCX_TLS=tcp,cuda_copy,rc
Run Benchmarks#
Finally, let’s run the merge benchmark from dask-cuda
.
This benchmark uses Dask to perform a merge of two dataframes that are distributed across all the available GPUs on your VM. Merges are a challenging benchmark in a distributed setting since they require communication-intensive shuffle operations of the participating dataframes (see the Dask documentation for more on this type of operation). To perform the merge, each dataframe is shuffled such that rows with the same join key appear on the same GPU. This results in an all-to-all communication pattern which requires a lot of communication between the GPUs. As a result, network performance will be very important for the throughput of the benchmark.
Below we are running for devices 0 through 7 (inclusive), you will want to adjust that for the number of devices available on your VM, the default
is to run on GPU 0 only. Additionally, --chunk-size 100_000_000
is a safe value for 32GB GPUs, you may
adjust that proportional to the size of the GPU you have (it scales linearly, so 50_000_000
should
be good for 16GB or 150_000_000
for 48GB).
# Default Dask TCP communication protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --no-show-p2p-bandwidth
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 100000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | tcp
Device(s) | 0,1,2,3,4,5,6,7
RMM Pool | True
Frac-match | 0.3
Worker thread(s) | 1
Data processed | 23.84 GiB
Number of workers | 8
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
48.51 s | 503.25 MiB/s
47.85 s | 510.23 MiB/s
41.20 s | 592.57 MiB/s
================================================================================
Throughput | 532.43 MiB/s +/- 22.13 MiB/s
Bandwidth | 44.76 MiB/s +/- 0.93 MiB/s
Wall clock | 45.85 s +/- 3.30 s
# UCX protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx --no-show-p2p-bandwidth
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 100000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | ucx
Device(s) | 0,1,2,3,4,5,6,7
RMM Pool | True
Frac-match | 0.3
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 23.84 GiB
Number of workers | 8
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
9.57 s | 2.49 GiB/s
6.01 s | 3.96 GiB/s
9.80 s | 2.43 GiB/s
================================================================================
Throughput | 2.82 GiB/s +/- 341.13 MiB/s
Bandwidth | 159.89 MiB/s +/- 8.96 MiB/s
Wall clock | 8.46 s +/- 1.73 s