PyTorch has a tool that automatically handles fault tolerance with checkpointing called torchrun. Specifically, torchrun has the following functionalities:

  • worker failures are handled gracefully by restarting your workers at the previously saved checkpoint
  • environment variables, like RANK and WORLD_SIZE, are automatically set for you. All environment variables set by PyTorch can be found here
  • number of nodes being leveraged can vary during training (elasticity)

In this notebook we will introduce how to utilize environment variables automatically set in torchrun as well as how to use checkpointing. We will not cover elasticity as it is outside the scope of this course. To explain the functionality of the torchrun we will:

  1. Cover the code modifications needed using the MNIST example from the previous notebook
  2. Explain how to launch your script.

Let’s get started.

 
© Chishiki-AI  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)