Multi-node Distributed Training with PyTorch

Krishna Kumar
The University of Texas at Austin, Chishiki-AI

5/2024 (original)

In the previous parts, we learned that training deep learning models with multiple GPUs is critical as the number of parameters for neural networks grows. The same logic of using more compute resources to speed things up motivates us to scale further up beyond resources on one node or one machine.

Before we learn about multi-node training, let’s first review some notions from the previous part of the tutorial.

 
© Chishiki-AI  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)