Cornell Virtual Workshop > Building Scalable CNN Models > Single Node Multi-GPU Training with torchrun

Single Node Multi-GPU Training with torchrun

Krishna Kumar
The University of Texas at Austin, Chishiki-AI

5/2024 (original)

In the previous notebook we introduced parallel computing at HPC centers and a specific algorithm for distributed training with PyTorch, DDP. However, there can be many challenges that arise when utilizing multiple computational resources. One of the biggest challenges is what happens when one of the computational resources fails during training? In this notebook we will discuss these issues and how we set up our parallel implementation to be able to continue to run despite intermittent computational resources. We will also combine the information in this tutorial and the previous tutorial and apply it to the DesignSafe Classifier we used previously.

traditional_method — **Figure 1.** Fault Tolerance Image source: [1]

Specifically, in this tutorial, we will cover the following material:

Introduce fault tolerance
Introduce PyTorch’s torchrun
Go over code modifications needed to use torchrun to launch distributed code that is fault tolerant
Implement a script for training the DesignSafe classifier using torchrun

References:

[1] URL: https://impanix.com/wp-content/uploads/2023/05/What-is-Fault-Tolerance-Types-and-How-To-Implement-768x461.png

[2] URL: https://arxiv.org/abs/2006.15704

[3] URL: https://pytorch.org/tutorials/beginner/ddp_series_theory.html

© Chishiki-AI | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)