For this notebook, start a session with 2 Nodes and 2 Tasks.

This tutorial will approach multi-node distributed training using PyTorch with the following steps:

  • Message exchanging example with PyTorch DDP
    • torchrun
    • Set environment variable with mpirun and Slurm
  • DesignSafe ResNet Case Study
    • Major code modifications for ResNet Training
    • Data preparation
    • Launch training script in a containerized environment
 
© Chishiki-AI  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)