Distributed PyTorch
Distributed training within PyTorch is implemented through the torch.distributed
module. An overview of the different capabilities supported by that module is provided in the PyTorch Distributed Overview. While different distributed computing architectures are supported in PyTorch, we will focus here on
Distributed Data Parallel (DDP) training. The DDP submodule offers a higher-level interface to support operations commonly required for distributed data parallel training, and it sits on top of lower-level libraries that provide support for interprocess communication (e.g., point-to-point and collective communications). If you are intending to develop a distributed training application using PyTorch, we recommended reading through the Overview to understand your possible options, and what specific components might be best-suited for your particular application.
The PyTorch documentation contains a detailed tutorial on Getting Started with Distributed Data Parallel, which we will examine some aspects of, in particular the code contained in the sections discussing the Basic Use Case, Save and Load Checkpoints, and Combining DDP with Model Parallelism. The tutorial provides code for each of these sections that you can copy and paste into one Python source code file to run on Frontera. While we will highlight a few aspects of this code and its use of the DDP framework, readers are encouraged to read through the entire tutorial to get a fuller sense of the issues involved in using DDP for distributed training.
One key difference between this DDP code and the single-GPU PyTorch code that we considered previously is that this code introduces a process group for distributed computation:
In this case, the process group is configured to use the gloo backend for interprocess communication, but other backends are also available. It is also configured with the parameter world_size
, which reflect the number of independent processes to run, i.e., the number of GPUs we want to run on. In the main()
function at the end, this world_size is computed by querying PyTorch how many cuda devices are available:
The DDP environment is imported as such:
and once a neural network model is constructed using the torch.nn
API, the model is wrapped with DDP to enable the model to run concurrently on multiple GPUs, each identifiable with a distinct rank:
PyTorch contains a multiprocessing
module, which is a wrapper around the multiprocessing module found in the Python Standard Library. The PyTorch multiprocessing API is completely compatible with the original API, adding some functionality underneath to provide shared views of the same data in different processes. In the example code we are considering, the multiprocessing (mp) spawn function is used to run a specified function (demo_fn) in separate processes:
In order to run the code contained within the tutorial, we will reuse the pytorch virtual environment that we created previously. The mechanics of running this code on Frontera are the same as what we used in PyTorch on Frontera, and we will just use the interactive idev-based method here rather than the slurm sbatch method, although you can certainly consult the previous instructions if you prefer the latter.
The aggregated code contained in the use cases linked above contain a main code block that runs three differ demos using the run_demo
function:
Once you've assembled all the code into a source file with a name like torchDDPUseCase.py
and activated your virtual environment, you can use idev
to get an interactive node, and then run:
The output that is printed by these demos is not very exciting, but you can certainly go in and modify the code if you're interested in printing out other variables in the program. And as mentioned previously, you can also start up a second terminal, ssh into your interactive node on Frontera, and run nvidia-smi -l 5
to see that all 4 GPUs are being engaged.