DesignSafe ResNet Case Study: Train on Multiple Nodes
ResNet Training
With the understanding of multi-node distributed training with PyTorch, we can adapt our DesignSafe ResNet training script into a version that runs in parallel on multiple nodes.
In single node multi-GPU training, local rank is enough to allocate the rank for a GPU. In the multi-node setting, we need to use $GLOBAL_RANK
variable for this. And that is the ONLY difference between single and multi-node training.
In summary, the only change we need to make in the training Python script, is when we save/load the model, since we only want to save one copy of the model, we choose to save it on the GPU device with GLOBAL_RANK
, instead of the devices with LOCAL_RANK
0.
Single node:
Multi node:
We will launch the Python script with the “mpirun + torchrun” launcher we used for the message passing example:
We use mpirun
to launch a bash script from one node:
mpirun -np2-ppn1 run_distributed.sh
Where the run_distributed.sh script generates the torchrun
command on each node for us:
Data Preparation
One last note for multi-node training is about data preparation.
In the code, we download data to $SCRATCH, which all the nodes have access. However, if you would like to download data to /tmp/, the local SSD, modifications are needed.
If you want to use /tmp/ for downloading files, we would move data to each node’s local SSD before training, to obtain superior I/O speed and low latency. In the single node example, this would be done by a copy command to /tmp
:
In the multi-node training case, we would need to copy our dataset onto each one of the nodes’ local SSD, meaning we need to launch this copy command on each one of the nodes. We do this by ssh on to each node in the job and launch the copy command.
We are using $SCRATCH for the DesignSafe Example, so these modifications are not included in copy_data.sh.
Now that we have walked through all the bullet points regarding multi-node training, we can adapt our DesignSafe code to run on multiple nodes.
Launch Training Script in Containerized Environment
Lab: DesignSafe Example
In the notebook, run cells in " multi-node Application with PyTorch Distributed: Damage Level Example" to run the code shown in this page.
Compare Performance
After launching the training command, you can compare the time it takes to train our model for 1 epoch with training on single node. How does our training scale?
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)