Cornell Virtual Workshop > Building Scalable CNN Models > Single Node Multi-GPU Training with torchrun

Summary

In this notebook we covered the basics of how Distributed Data Parallel (DDP) works, highlighted major code modifications needed to convert a nondistributed script into a distributed training script, and made these modifications for the DesignSafe Image Classifier example. In the next section, we will discuss how we can launch this script to leverage single and multiple nodes on HPC systems.

Back

© Chishiki-AI | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)