Cornell Virtual Workshop > Building Scalable CNN Models > Multi-node Distributed Training with PyTorch

Lab: CNN with Multi-Node Multi-GPU

Instructions:

Start a new TAP session with 2 Nodes and 2 Tasks, instead of 1 Node and 1 Task as documented in Lab Setup
If you have not done so, download the notebooks as instructed in Lab Setup
Navigate to the sciml-course/cnn directory and open cnn_part5.ipynb
Do not change the kernel for this lab

Tips:

If you are running the notebook and encounter an error with mpirun, double check that you started a TAP session with 2 Nodes and 2 Tasks and are using the default kernel.

This notebook will use the same hyperparameters as used in part 1:

Learning Rate (lr): how much model parameters are updated at each batch/epoch
Batch Size: number of data points used to estimate gradients at each iteration
Epochs: Number of times to iterate over our entire dataset in optimization process

These hyperparameters will be used throughout the notebook.

Back

© Chishiki-AI | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)