Notebook Outline
For this notebook, start a session with 2 Nodes and 2 Tasks.
This tutorial will approach multi-node distributed training using PyTorch with the following steps:
-
Message exchanging example with PyTorch DDP
- torchrun
- Set environment variable with mpirun and Slurm
-
DesignSafe ResNet Case Study
- Major code modifications for ResNet Training
- Data preparation
- Launch training script in a containerized environment
©
Chishiki-AI
|
Cornell University
|
Center for Advanced Computing
|
Copyright Statement
|
Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)