Cornell Virtual Workshop > Building Scalable CNN Models > Single Node Multi-GPU Training with torchrun

Reused Code from Parts 1 & 2

PyTorch aims to make code non-intrusive in that you can take existing code which trains a neural net on a single processer an easily scale out when hitting resource limitations with minimal code changes. With this spirit in mind, let’s first point out all the code that can be reused. Below are a set of functions and import statements that we will not need to modify that were created previously.

Next, let’s highlight the modifications that need to be made to scale our DesignSafe Classifier code in the following order. As a reminder below are the 8 modifications we highlighted in the this and the previous tutorial.

Serial to Parallel Modifications

Create a process group
Use PyTorch’s DistributedSampler to ensure that data passed to each GPU is different
Wrap Model with PyTorch’s DistributedDataParallel
Modify Training Loop to write model from one GPU
Close process group

Torchrun Modifications

Modify code for environment variables set by torchrun:
1. Remove code that sets environment variables as this done for you automatically with torchrun.
2. Instead, use these environment variables set by pytorch and instead of explicitly defining them.
Add code for writing checkpoints and resuming training after failure
1. Create location to store checkpoints
2. Read checkpoints if they exist and resume training at epoch checkpoint was written
3. Write checkpoints periodically during training
Remove using the mp.spawn to parallelize code and replace this with a function call, as this is done automatically by torchrun

Back

© Chishiki-AI | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)