Cornell Virtual Workshop > AI with Deep Learning > Distributed Training

Horovod

Horovod is a distributed deep learning framework that supports multiple backends, including TensorFlow/Keras, PyTorch, and Apache MXNet. The primary motivation for Horovod was to make it easy to begin with a program using a single GPU for training and to scale it to be able to leverage multiple GPUs, by requiring little in the way of program modification as compared to a non-distributed version of a program running on a single GPU. Horovod is constructed using the MPI model to distribute computation, and was initially developed when packages like TensorFlow, Keras, and PyTorch did not provide as much functionality for distributed training than they currently do. Therefore, if you have an existing TensorFlow/Keras or PyTorch program that you would like to scale out, you might first want to consider whether the distributed training support provided by those packages is sufficient for your needs, or whether you need additional functionality or performance that Horovod provides.

We will consider here an example from the Horovod documentation intended for use with TensorFlow/Keras v2, representing a multi-GPU implementation of the MNIST training problem. Consult this file in the Horovod Github repository for access to the code. Other code examples — including code for use with other deep learning backends, such as PyTorch — can also be found in the Horovod documentation.

We will not reproduce all of the source code in the example here, but will note that Horovod's design philosophy seems to be such that a little bit of initial configuration in your code is needed, but that Horovod then mostly lets you proceed with a single-GPU code that you might have previously developed. In this example, Horovod pins each of the available GPUs to a local rank:

# Horovod: initialize Horovod.
hvd.init()

# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

Building a virtual environment

We're going to build a new virtual environment that will include both TensorFlow/Keras and Horovod, similar to some degree to one of the environments we built previously. It would appear that with the python3/3.9.2 module loaded on Frontera, one cannot successfully install and run tensorflow beyond version 2.8.2, so we will install that version into our environment. We will also install horovod version 0.25.0, based on TACC recommendations. Note that our installation of horovod comes with the specification of an additional environment variable, HOROVOD_WITH_TENSORFLOW=1. Because Horovod can work with multiple deep learning backends, setting this variable provides a mechanism for ensuring its ability to work with TensorFlow, as in the example we are considering. If you're interested in building the environment for use on Frontera, execute the following commands:

module load python3/3.9.2
python3 -m venv tfhorovod && source tfhorovod/bin/activate
pip3 install --upgrade pip
pip3 install tensorflow==2.8.2 tensorflow_datasets
HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0

Running the example code

You should have already downloaded the sample code linked above, which is in a file named tensorflow2_keras_mnist.py. If you have not, please do so now.

The mechanics of running this code on Frontera are similar to what we used in TensorFlow on Frontera, and we will just use the interactive idev-based method here rather than the slurm sbatch method, although you can certainly consult the previous instructions if you prefer the latter.

First, request a single node (with 4 attached GPUs) on the rtx-dev queue:

idev -N 1 -n 4 -p rtx-dev

To follow along more fully, you might want to open up a second terminal on Frontera, and ssh into the rtx-dev compute node that you were just assigned through idev (e.g., ssh c196-022). This will enable you to monitor the GPU usage via the nvidia-smi tool. If you've ssh'd into that compute node, you might as well start up the monitoring tool in a continual loop (refreshing the output every 5 seconds): nvidia-smi -l 5

Next, load the cuda-relevant and phdf5 modules for use with this example code:

module load cuda/11.3 cudnn nccl
module load phdf5

Although Horovod can run on multiple GPUs, the process for invoking that is a bit different than what we have seen for the other distributed training examples. In particular, Horovod is built upon the MPI library for distributed computation, and is run here as a 4-way parallel job, which each process offloading computations to a single GPU. On TACC systems, we can use the TACC utility ibrun to run the job in parallel using 4 processes:

time ibrun -n 4 python3 tensorflow2_keras_mnist.py

While that program is running, switch to your other terminal and examine the GPU usage. You should see that all 4 GPUs are being utilized, each at roughly 30% capacity. You should observe printed output similar to what is included at the bottom of this page. In this second terminal, you can also run the top command, and you should observe that 4 python3 processes are running, each one pinned to a specific GPU.

What would happen if we didn't run this as a 4-way parallel job? We can test that out by removing the use of the ibrun:

time python3 tensorflow2_keras_mnist.py

While this version of the program is running, switch to your other terminal and examine the GPU usage. Also run top to check the python processes running. You should see that only 1 of the GPUs is being utilized, at approximately 50% capacity, with only 1 associated python process running. The 4-GPU version ran faster than the 1-GPU version, due to the fact that it was able to make use of more GPU capacity. But it should be stressed that this is a small problem, for which data are not being fed quickly enough to all the GPU resources to exhibit as much acceleration as we might expect. For multi-GPU nodes such as those on Frontera, you will want to make sure that you are running problems that can effectively make use of that enhanced capacity and feed sufficient data through the computational pipeline. As we saw in a previous example, tensorflow is issuing warnings about the callback method being slow, which is an indicator that we are subdividing the data into sufficiently small batches that the code performance is being inhibited by the overhead of that operation.

After you have run this code example, you should notice in your current directory a set of files, named checkpoint-1.h5 through checkpoint-24.h5, one for each training epoch. These are HDF5 (.h5) files written by the program to checkpoint the current state of the computation. While this is just a short program to demonstrate basic operations, for long-running programs, it is often advisable to carry out such checkpointing at regular intervals, so that the program could be restarted from an intermediate state if it did not run to completion. See our companion material for general information on checkpointing and restarting, as well as the TensorFlow and PyTorch documentation for further information about how to restart from these files.

If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.

Output from tensorflow2_keras_mnist.py

TACC:  Starting up job 4543197
TACC:  Starting parallel tasks...
2022-06-27 12:47:55.491262: I tensorflow/core/platform/cpu_feature_guard.cc:193]
  This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
  to use the following CPU instructions in performance-critical operations:  AVX2 FMA
  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-27 12:47:55.530243: I tensorflow/core/platform/cpu_feature_guard.cc:193]
  This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
  to use the following CPU instructions in performance-critical operations:  AVX2 FMA
  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-27 12:47:55.540663: I tensorflow/core/platform/cpu_feature_guard.cc:193]
  This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
  to use the following CPU instructions in performance-critical operations:  AVX2 FMA
  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-27 12:47:55.540922: I tensorflow/core/platform/cpu_feature_guard.cc:193]
  This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
  to use the following CPU instructions in performance-critical operations:  AVX2 FMA
  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-27 12:47:56.487874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532]
  Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14795 MB memory:  ->
  device: 0, name: Quadro RTX 5000, pci bus id: 0000:02:00.0, compute capability: 7.5
2022-06-27 12:47:56.505846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532]
  Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14795 MB memory:  ->
  device: 2, name: Quadro RTX 5000, pci bus id: 0000:82:00.0, compute capability: 7.5
2022-06-27 12:47:56.514656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532]
  Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14795 MB memory:  ->
  device: 3, name: Quadro RTX 5000, pci bus id: 0000:83:00.0, compute capability: 7.5
2022-06-27 12:47:56.532731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532]
  Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14795 MB memory:  ->
  device: 1, name: Quadro RTX 5000, pci bus id: 0000:03:00.0, compute capability: 7.5
Epoch 1/24
2022-06-27 12:48:00.192209: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-06-27 12:48:00.202834: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-06-27 12:48:00.329070: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-06-27 12:48:00.346085: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
  5/125 [>.............................] - ETA: 1s - loss: 2.1261 - accuracy: 0.2969
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time
  (batch time: 0.0121s vs `on_train_batch_end` time: 0.0481s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time
  (batch time: 0.0121s vs `on_train_batch_end` time: 0.0480s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time
  (batch time: 0.0121s vs `on_train_batch_end` time: 0.0483s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time
  (batch time: 0.0121s vs `on_train_batch_end` time: 0.0480s). Check your callbacks.
125/125 [==============================] - 7s 13ms/step - loss: 0.4035 - accuracy: 0.8754 - lr: 0.0020
Epoch 2/24
125/125 [==============================] - 2s 13ms/step - loss: 0.1020 - accuracy: 0.9691 - lr: 0.0030
Epoch 3/24
124/125 [============================>.] - ETA: 0s - loss: 0.0749 - accuracy: 0.9774
Epoch 3: finished gradual learning rate warmup to 0.004.
125/125 [==============================] - 2s 13ms/step - loss: 0.0749 - accuracy: 0.9777 - lr: 0.0040
Epoch 4/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0620 - accuracy: 0.9811 - lr: 0.0040
Epoch 5/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0504 - accuracy: 0.9847 - lr: 0.0040
Epoch 6/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0461 - accuracy: 0.9859 - lr: 0.0040
Epoch 7/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0418 - accuracy: 0.9871 - lr: 0.0040
Epoch 8/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0316 - accuracy: 0.9898 - lr: 0.0040
Epoch 9/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0320 - accuracy: 0.9901 - lr: 0.0040
Epoch 10/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0324 - accuracy: 0.9897 - lr: 0.0040
Epoch 11/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0287 - accuracy: 0.9904 - lr: 0.0040
Epoch 12/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0244 - accuracy: 0.9917 - lr: 0.0040
Epoch 13/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0235 - accuracy: 0.9921 - lr: 0.0040
Epoch 14/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0249 - accuracy: 0.9916 - lr: 0.0040
Epoch 15/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0222 - accuracy: 0.9923 - lr: 0.0040
Epoch 16/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0193 - accuracy: 0.9936 - lr: 0.0040
Epoch 17/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0206 - accuracy: 0.9937 - lr: 0.0040
Epoch 18/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0204 - accuracy: 0.9931 - lr: 0.0040
Epoch 19/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0182 - accuracy: 0.9940 - lr: 0.0040
Epoch 20/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0183 - accuracy: 0.9938 - lr: 0.0040
Epoch 21/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0179 - accuracy: 0.9939 - lr: 0.0040
Epoch 22/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0181 - accuracy: 0.9940 - lr: 0.0040
Epoch 23/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0182 - accuracy: 0.9941 - lr: 0.0040
Epoch 24/24
125/125 [==============================] - 2s 13ms/step - loss: 0.0148 - accuracy: 0.9952 - lr: 0.0040
TACC:  Shutdown complete. Exiting.

real	1m25.295s
user	1m19.149s
sys	0m13.885s
</pre>

Back