Single Node Multi-GPU Training with torchrun
Krishna Kumar
The University of Texas at Austin, Chishiki-AI
5/2024 (original)
In the previous notebook we introduced parallel computing at HPC centers and a specific algorithm for distributed training with PyTorch, DDP. However, there can be many challenges that arise when utilizing multiple computational resources. One of the biggest challenges is what happens when one of the computational resources fails during training? In this notebook we will discuss these issues and how we set up our parallel implementation to be able to continue to run despite intermittent computational resources. We will also combine the information in this tutorial and the previous tutorial and apply it to the DesignSafe Classifier we used previously.

Specifically, in this tutorial, we will cover the following material:
- Introduce fault tolerance
- Introduce PyTorch’s torchrun
- Go over code modifications needed to use torchrun to launch distributed code that is fault tolerant
- Implement a script for training the DesignSafe classifier using torchrun
[2] URL: https://arxiv.org/abs/2006.15704
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)