Summary
In this notebook we introduced how to do multi-node distributed training with PyTorch. We used a message exchanging example to show how to write a distributed PyTorch application across multiple nodes. In the example, we used torchrun launcher. We set environment variables with mpirun and Slurm. With knowledge of torchrun, mpi and Slurm, we further learned about doing distributed machine learning model training across multiple nodes with the DesignSafe ResNet case study. We talked about major code modifications for ResNet Training, learned about data preparation on multiple nodes, and finally launched our training script in a containerized environment.
©
Chishiki-AI
|
Cornell University
|
Center for Advanced Computing
|
Copyright Statement
|
Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)