To utilize torchrun’s functionality we will need to make some changes to the distributed scaling script we created in the previous notebook. These code changes include:

  1. Modify code for environment variables set by torchrun
    1. Remove code that sets environment variables as this done for you automatically with torchrun.
    2. Instead, use these environment variables set by pytorch and instead of explicitly defining them.
  2. Add code for writing checkpoints and resuming training after failure.
    1. Create location to store checkpoints
    2. Read checkpoints if they exist and resume training at epoch checkpoint was written
    3. Write checkpoints periodically during training
  3. Remove using the mp.spawn to parallelize code and replace this with a function call, as this is done automatically by torchrun

Let’s highlight the listed code changes above by revisiting the MNIST example we used in previous notebook. In order to implement these changes only two functions need to be modified, init_distributed and main functions.

1. Modify code for environment variables set by torchrun

In order to use the environment variables set by torchrun we will need to make modifications to both the init_distributed and main functions as highlighted in the first cell. In summary, we removed the local_rank and world_size arguments from the init_distributed function and instead set these variables within the function from the environment variables set by torchrun. Additionally, we modify our main function to utilize the local_rank environment variable to set the device where our model should be stored as well as call the modified init_distributed function.

2. Add code for writing checkpoints and resuming training after failure

We need to make several modifications to the main function to incorporate writing checkpoints and resuming at a checkpoint after process failure. These modifications are highlighted below with rows of # and include line by line comments to explain why each modification was written.

You can find the entire modified script with the changes highlighted above in the file mnist_torchrun.py generated in the lab. Next, we will learn how to run this script with torchrun.

 
© Chishiki-AI  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)