Horovod is a distributed deep learning framework that supports multiple backends, including TensorFlow/Keras, PyTorch, and Apache MXNet. The primary motivation for Horovod was to make it easy to begin with a program using a single GPU for training and to scale it to be able to leverage multiple GPUs, by requiring little in the way of program modification as compared to a non-distributed version of a program running on a single GPU. Horovod is constructed using the MPI model to distribute computation, and was initially developed when packages like TensorFlow, Keras, and PyTorch did not provide as much functionality for distributed training than they currently do. Therefore, if you have an existing TensorFlow/Keras or PyTorch program that you would like to scale out, you might first want to consider whether the distributed training support provided by those packages is sufficient for your needs, or whether you need additional functionality or performance that Horovod provides.

We will consider here an example from the Horovod documentation intended for use with TensorFlow/Keras v2, representing a multi-GPU implementation of the MNIST training problem. Consult this file in the Horovod Github repository for access to the code. Other code examples — including code for use with other deep learning backends, such as PyTorch — can also be found in the Horovod documentation.

We will not reproduce all of the source code in the example here, but will note that Horovod's design philosophy seems to be such that a little bit of initial configuration in your code is needed, but that Horovod then mostly lets you proceed with a single-GPU code that you might have previously developed. In this example, Horovod pins each of the available GPUs to a local rank:

Building a virtual environment

We're going to build a new virtual environment that will include both TensorFlow/Keras and Horovod, similar to some degree to one of the environments we built previously. It would appear that with the python3/3.9.2 module loaded on Frontera, one cannot successfully install and run tensorflow beyond version 2.8.2, so we will install that version into our environment. We will also install horovod version 0.25.0, based on TACC recommendations. Note that our installation of horovod comes with the specification of an additional environment variable, HOROVOD_WITH_TENSORFLOW=1. Because Horovod can work with multiple deep learning backends, setting this variable provides a mechanism for ensuring its ability to work with TensorFlow, as in the example we are considering. If you're interested in building the environment for use on Frontera, execute the following commands:

Running the example code

You should have already downloaded the sample code linked above, which is in a file named tensorflow2_keras_mnist.py. If you have not, please do so now.

The mechanics of running this code on Frontera are similar to what we used in TensorFlow on Frontera, and we will just use the interactive idev-based method here rather than the slurm sbatch method, although you can certainly consult the previous instructions if you prefer the latter.

First, request a single node (with 4 attached GPUs) on the rtx-dev queue:

To follow along more fully, you might want to open up a second terminal on Frontera, and ssh into the rtx-dev compute node that you were just assigned through idev (e.g., ssh c196-022). This will enable you to monitor the GPU usage via the nvidia-smi tool. If you've ssh'd into that compute node, you might as well start up the monitoring tool in a continual loop (refreshing the output every 5 seconds): nvidia-smi -l 5

Next, load the cuda-relevant and phdf5 modules for use with this example code:

Although Horovod can run on multiple GPUs, the process for invoking that is a bit different than what we have seen for the other distributed training examples. In particular, Horovod is built upon the MPI library for distributed computation, and is run here as a 4-way parallel job, which each process offloading computations to a single GPU. On TACC systems, we can use the TACC utility ibrun to run the job in parallel using 4 processes:

While that program is running, switch to your other terminal and examine the GPU usage. You should see that all 4 GPUs are being utilized, each at roughly 30% capacity. You should observe printed output similar to what is included at the bottom of this page. In this second terminal, you can also run the top command, and you should observe that 4 python3 processes are running, each one pinned to a specific GPU.

What would happen if we didn't run this as a 4-way parallel job? We can test that out by removing the use of the ibrun:

While this version of the program is running, switch to your other terminal and examine the GPU usage. Also run top to check the python processes running. You should see that only 1 of the GPUs is being utilized, at approximately 50% capacity, with only 1 associated python process running. The 4-GPU version ran faster than the 1-GPU version, due to the fact that it was able to make use of more GPU capacity. But it should be stressed that this is a small problem, for which data are not being fed quickly enough to all the GPU resources to exhibit as much acceleration as we might expect. For multi-GPU nodes such as those on Frontera, you will want to make sure that you are running problems that can effectively make use of that enhanced capacity and feed sufficient data through the computational pipeline. As we saw in a previous example, tensorflow is issuing warnings about the callback method being slow, which is an indicator that we are subdividing the data into sufficiently small batches that the code performance is being inhibited by the overhead of that operation.

After you have run this code example, you should notice in your current directory a set of files, named checkpoint-1.h5 through checkpoint-24.h5, one for each training epoch. These are HDF5 (.h5) files written by the program to checkpoint the current state of the computation. While this is just a short program to demonstrate basic operations, for long-running programs, it is often advisable to carry out such checkpointing at regular intervals, so that the program could be restarted from an intermediate state if it did not run to completion. See our companion material for general information on checkpointing and restarting, as well as the TensorFlow and PyTorch documentation for further information about how to restart from these files.

If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.

Output from tensorflow2_keras_mnist.py
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement