Parallelism in Training
Distributing a large computation over multiple resources to speed up calculations is a long-standing focus of high performance computing (HPC). Within the realm of deep learning, accelerating the process of model training in particular can be important for very large models and/or datasets. For systems with multiple CPUs as well as multiple GPUs, training can in principle be distributed over both types of resources. For CPU nodes with multiple GPUs attached, such as is the case on the Frontera system at TACC, it might be sufficient to focus on distributing model training over multiple GPUs on a single CPU node, since they carry out the bulk of the heavy computations in software environments such as TensorFlow and PyTorch.
Distributed training primarily occurs in two main modes: data parallel and model parallel. For training focused on classifying large images, a more specific form of data parallelism
In data parallel training, a neural network model is replicated across multiple CPUs/GPUs nodes, with each node responsible for training on only part of the data (i.e., the data are distributed across nodes). Since each version of the model sees only a portion of the data, the parameter values they learn through training are likely to differ. Periodic communication between the processes running this model, therefore, is used to consistency of the parameters learned from each subset of the data. For example, the parameter values learned over each subset might be averaged, and then communicated back to each process.
In model parallel training, a neural network model itself is distributed across multiple CPUs/GPUs nodes, with each node responsible for holding only part of the network. This is typically done only if the model (i.e., the network and associated parameters) is so large that it does not fit in memory on a single node. The full model could be partitioned in different ways. For example, different processes might hold different layers of the neural network, or they might just hold portions of all the layers. Communication among the different sub-networks is necessary to enable the training of the entire model. Because model parallel training requires decomposing a complex structure like a neural network, it is generally more complicated to program than data parallel training, which typically involves just dividing up a dataset into pieces.
As with any form of distributed computation, many options exist on how precisely one chooses to distribute both data and computational elements across resources. Those seeking to squeeze optimal performance out of such codes for very large and computationally demanding problems might choose to explore a variety of such options. Variants and combinations of data- and model-parallel training exist, such as Fully Sharded Data Parallel (FSDP) training as recently supported within PyTorch.
In the pages that follow, we will demonstrate the use of data parallel training within TensorFlow/Keras (using the tensorflow.distribute
submodule), within PyTorch (using the torch.distributed
submodule), and with Horovod, a framework for distributed training that works with multiple deep learning backends. For the sake of concreteness, we will consider the use of these only on the Frontera GPU nodes, and we will examine the use of Horovod only with the TensorFlow/Keras backend. But given the additional guidance in these pages, you should be able to customize your own workflows to other systems or backends as needed.