Distributed TensorFlow
Distributed training across multiple computational resources within TensorFlow/Keras is implemented through the tf.distribute
module. Tensorflow/Keras provides support for different strategies, depending on how one wants to distribute the computation and on what resources that will be distributed over. In this example, we will consider the use of Keras to carry out synchronous, data-parallel training across multiple GPUs attached to a single CPU node, as described in the tutorial on Distributed training with Keras. We will run this code example on the GPU nodes on Frontera.
The MNIST problem is revisited in the Keras distributed training tutorial, with a Jupyter notebook providing the source code and commentary. For this particular code example, there is not a separate link provided to the raw python source code, so you will want to download the notebook and extract the source code using the instructions that we have provided previously. (You can either view the Jupyter notebook in Github or download it directly from this link.) For the purposes of discussion below, we are saving the python source code from this notebook to a file named keras_distributed.py
. We will not reproduce all the source code in this page, but will instead highlight some of the key features of the distributed multi-GPU code.
The key difference between this distributed TensorFlow/Keras code, and the single-GPU code that we considered previously, is that this code introduces a strategy for distributed computation:
A MirroredStrategy()
is one that replicates, or mirrors, the model across multiple GPUs on a single machine or CPU node, in order to implement a data parallel training strategy. Each GPU contains a full copy of the model, but processes only part of the data, and an all-reduce method is used to combine the gradients to allow all the GPUs to update the model parameters in a consistent manner. Other possible strategies are discussed in the Distributed training with TensorFlow guide.
Once a MirroredStrategy is constructed, there are downstream impacts on the rest of the code. The strategy is able to determine how many GPUs are available, and in the default behavior, all those GPUs are used for the computation. (One can choose to use only a subset, however, if that is desirable.) Because the training data will be split across GPUs, the batch size needs to be appropriately modified. In the code linked above, this modification makes use of information that the strategy holds about the number of model replicas, which is then followed by an suitable subdivision of the training and evaluation datasets:
Finally, the last important impact of the introduction of the distributed strategy is that the model definition gets embedded within a context (using the python with
keyword), which enables each of the replicas to hold a full copy of the model:
Once this distributed strategy and model are set up, TensorFlow/Keras can take care of communicating information between the model replicas, and combining everything into a model ready for evaluation.
We are going to reuse the tf282 virtual environment that we created previously, along with the necessary Lmod modules. The mechanics of running this code on Frontera are the same as what we used in TensorFlow on Frontera, and we will just use the interactive idev-based method here rather than the slurm sbatch method, although you can certainly consult the previous instructions if you prefer the latter.
The output from this run is reproduced at the bottom of this page (with some minor edits to adjust the formatting). We highlight a few points of interest. If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.
Using all 4 GPUs
As intended, this distributed training code is now using all 4 of the GPUs attached to the Frontera rtx node that we have been allocated. You might want to ssh into your rtx node in a second terminal, in order to monitor the GPU utilization with the nvidia-smi command.
on_train_batch_end
You might see a warning like this:
It is not obvious what this means, but if one searches the web for this type of warning, you will see that — for this problem, at least — breaking the dataset into batches of this size results in a slow-down of the program rather than a speed-up, due to the overhead of setting up the batches.
Suggested code modifications and experiments
Here are some suggestions about how you might explore this code example further:
- Modifying batch size: In the keras_distributed.py code, increase the batch size to see if (a) the code runs faster and (b) the warning about the
on_train_batch_end
callback method is eliminated. -
Try a different dataset, maybe with more data elements than the mnist problem. There is, for example, the emnist dataset, which is intended as a larger version of mnist-like data. If you decide to modify the keras_distributed.py code to handle another dataset, you might need to:
- Reconfigure some of the data loading and splitting code, depending on how it is packaged
- Modify the model definition to handle to number of class labels associated with the new dataset (currently hardwired at 10 for mnist). There might be an informational data structure linked to the dataset that you can query to get the number of class labels.
Output from keras_distributed.py