DDP with PyTorch
Next, we will introduce how to convert a non-distributed training script into a distributed training script that uses PyTorch’s DDP. The major modifications needed to scale a PyTorch training script are as follows:
- Create a process group
- Use PyTorch’s DistributedSampler to ensure that data passed to each GPU is different
- Wrap Model with PyTorch’s DistributedDataParallel
- Modify Training Loop to write model from one GPU
- Close process group
Next, let’s dive into each of the modifications above in more detail.
Create Process Group
The first step to scaling a PyTorch script is to set up a process group – that is establishing the group of processors that will be used so that the processors can communicate. This can be done via the torch.distributed.init_process_group
method.
Below is an example of how to set up the process group locally First, we set environment variables for the IP address for the rank 0 process and a free port. We then set the device with local_rank. Later on in this tutorial we will give an example of how this can be set up for an HPC cluster. Note, we set the device prior to setting up the process group. This is important to prevent hangs or excessive memory utilization. Use torch.distributed.init_process_group
method to initialize process group as the final step.
Create Data DistributedSampler
The purpose of the DistributedSampler
is to distribute data amongst the various processes. It ensures that the batch that each GPU receives is different. The distributed sampler passes an iterator that sends data to the various processes.
In the code cell below, we use the DataLoader as we saw in the previous notebook, but we pass the DistributedSampler
via the sampler argument. We also change shuffle from True to False. Note, in the example below each GPU would receive a batch size of 32 data samples. Thus, the actual batch size would be number_gpu * 32.
Wrap Model with DistributedDataParallel
In order to use PyTorch’s Distributed Data Parallel we need to wrap our model (e.g. ResNet18) with the DDP
wrapper. In the function below we combine instantiating our process group, setting up the distributed sampler, and wrapping our model with DDP into one function. This will be utilized in a later lab, Distributed Code for Multiple GPUs on One Node.
Note that when we wrap our model with DDP we will need to modify other codes that access the state of our model. Previously when we wanted to access a model’s .state_dict()
we would do this by calling the following method of our model:
When the model has been wrapped with DDP we would need to make the following modification:
Train Model
When saving our model’s checkpoints throughout the training process and using DDP, by default models on each GPU are saved. This is unnecessarily redundant. To fix this we can modify our train function to only save our model’s state_dict from one of our GPUs. The pseudocode below highlights this modification.
Clean Up Process Group
Once we have trained our model we can destroy the process group using the function below. This will also be utilized in a later lab, Distributed Code for Multiple GPUs on One Node.
Summary
In the above text we highlighted the major modifications that are needed to scale a PyTorch script to multiple GPUs. Next, we will code a simple neural network training script and modify it to use DDP.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)