Leveraging multiple GPUs to train neural networks comes with many challenges. One of the biggest is what happens when GPUs fail due to many potential issues (overheating, old system wears out, viruses, etc.). Fault tolerance is the ability of a system to maintain operation despite one of its components failing. One way to combat component failure is via checkpointing. In checkpointing we periodically save the state of our application (in the case of deep learning the current weights of our model), so that if a process failure occurs we can resume our application from the previous checkpoint (See Figure 2).

Visual of checkpointing
Figure 2. Visual of checkpointing. CP refers to a point in time when a checkpoint is saved.

Next, we will talk about torchrun, PyTorch’s tools for launching distributed training that will handle fault tolerance via checkpointing for you.

 
© Chishiki-AI  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)