In order to launch our new mnist_torchrun.py script you can use the torchrun command (torch.distributed.run). There are several arguments that could pass with torchrun. These arguments vary based on the type of job you are launching. For example, the arguments needed for a single node job versus a multi-node. For now, we will cover the arguments needed for a single node job.

Let's start by introducing three arguments that can be helpful when launching a single node job:

  • --standalone : This indicates to PyTorch that you are running a single machine multiworker job. It automatically sets up a rendezvous backend that is represented by a C10d TCP store on port 29400
  • --nnodes : Total number of nodes being used
  • --nproc-per-node : number of processes per node; this is typically set to the number of GPUs on your machine(s)

To launch a generic training script (YOUR_TRAINING_SCRIPT.py) on a single node with 4 GPUs you can do the following:

Next, let's run our MNIST training script with torchrun:

 
© Chishiki-AI  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)