Cornell Virtual Workshop > AI with Deep Learning > Running PyTorch

PyTorch on Frontera CLX

Building on the general instructions presented previously for managing Python Installations at TACC, we will use the pip package manager to install PyTorch and associated dependencies on Frontera. We will mention specific version numbers, corresponding to what is currently installed on Frontera and through PyTorch at the time of this writing. To examine what is installed on Frontera, please consult our companion material on the module utility for more information. On this page, we are demonstrating running PyTorch on the CPU-only Frontera CLX nodes; see the preceding page if you are interested in running on the on the Frontera GPU nodes instead.

Building a virtual environment

In order to create and populate a python virtual environment for running PyTorch, you can execute commands from the command-line shell on Frontera such as listed in the code block below. We're going to build an environment using the CPU-only version of PyTorch 2.0.0, and will name the environment pytorch200cpu, but you can give it a different name or install a different PyTorch version if you prefer. (See the PyTorch "Start Locally" page for additional information about versions.) Since we're running on CPU-only nodes, we don't need to worry about installing a cuda package. On Frontera, enter the following commands to build the environment:

module load python3/3.9.2
python3 -m venv pytorch200cpu
source pytorch200cpu/bin/activate
pip3 install --upgrade pip
pip3 install torch==2.0.0 torchvision torchaudio urllib3==1.26.14 --extra-index-url https://download.pytorch.org/whl/cpu

It is worth highlighting a few points about the commands above:

The module load command will load the python3/3.9.2 module. If you would like to run with a different python version, you can see what is available via module spider python and then modify the load command to add a specific version number, and make sure it is compatible with the pytorch version you'd like to run.
The call to source pytorch200cpu/bin/activate activates the environment that you have just created. You should notice that your command prompt has changed to reflect the environment you are now running. You can run deactivate from the command line to exit this environment, say, if you wanted to activate a new environment.
The pip3 command to install torch and associated packages specifies that you want the cpu-only version of the code. You don't need to restrict yourself to the cpu-only version, since the general version of the package is able to auto-detect whether there are GPUs available, but can still run even if they are not. With the general version, however, you might see additional warning messages about not being able to find a GPU, which can be distracting. Therefore, if you know that you will be running on CPU-only nodes such as Frontera CLX, you can simplify things a bit by installing just that package.
The specification of urllib3==1.26.14 might be necessary since dependency requirements for more recent versions of that package are in conflict with the available python3 modules on Frontera. (The urllib3 package is a dependency of the other packages involved in this install.) It is possible that when more recent python3 versions become available on Frontera, this specific urllib3 version might not need to be specified.
While this is a minimal environment designed to run the code example below, you might want to install more packages to provide additional functionality for data processing, visualization, etc. (e.g., pandas, matplotlib, scikit-learn).

Running the code example

You should have already downloaded the sample code from our PyTorch Example page. If you have not, please do so now. In what follows below, we'll assume that we have saved that code to a file named quickstart_tutorial.py.

Depending on whether you want to run the example interactively or in batch, you will need to select the correct slurm and/or idev commands. See our companion material on Slurm for further information. If you want to run the code interactively, you will want to use the TACC shell command idev to request a node on one of the available partitions. For a small, short program like this, you can request a node on the development queue:

idev -p development

Once you've been allocated a compute node, make sure that you've activated the virtual environment that you created above, and navigate to the directory where the code example is situated. Now you can run the program as such:

python3 quickstart_tutorial.py

You should observe output such as that reproduced at the bottom of this page. (You might also see some other informational messages.)

Alternatively, if you'd like to submit a batch job to run this code, you could use a slurm submission file containing the following commands (where you replace /path/to/virtualenvironment to wherever your python virtual environment is located):

#!/bin/bash

export VENVDIR=/path/to/virtualenvironment
source $VENVDIR/bin/activate
module load phdf5/1.8.16

python3 quickstart_tutorial.py > log_quickstart_tutorial_frontera_clx.txt

If you copy and paste the above commands into a file called submit_quickstart_tutorial_frontera_clx.sh, you can submit a batch job via slurm with the following command:

sbatch -p development -t 00:10:00 -N 1 -n 1 submit_quickstart_tutorial_frontera_clx.sh

Once this has finished running, you can examine that output that was redirected to the file log_quickstart_tutorial_frontera_clx.txt. You should observe output similar to what is reproduced at the bottom of this page.

If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.

Output from quickstart_tutorial.py

After setting up the model and printing out a model summary, the program spends most of its time training parameters based on the data through a series of 5 epochs, eventually reaching a test accuracy of approximately 64% . Unlike the TensorFlow/Keras example we considered, this code does not automatically report the time required for each epoch to complete. (Hint: if you wanted to report the time, you could instrument some calls to the python time.time() function around the train+test steps for each of the epochs, and print out the time required for each pass.)

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cpu device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 2.308048  [    0/60000]
loss: 2.298382  [ 6400/60000]
loss: 2.277428  [12800/60000]
loss: 2.273406  [19200/60000]
loss: 2.260750  [25600/60000]
loss: 2.224988  [32000/60000]
loss: 2.238995  [38400/60000]
loss: 2.201748  [44800/60000]
loss: 2.205894  [51200/60000]
loss: 2.177154  [57600/60000]
Test Error:
 Accuracy: 44.0%, Avg loss: 2.166817

Epoch 2
-------------------------------
loss: 2.176655  [    0/60000]
loss: 2.167589  [ 6400/60000]
loss: 2.110211  [12800/60000]
loss: 2.132320  [19200/60000]
loss: 2.082440  [25600/60000]
loss: 2.022047  [32000/60000]
loss: 2.059790  [38400/60000]
loss: 1.980064  [44800/60000]
loss: 1.996935  [51200/60000]
loss: 1.918666  [57600/60000]
Test Error:
 Accuracy: 57.4%, Avg loss: 1.914325

Epoch 3
-------------------------------
loss: 1.951239  [    0/60000]
loss: 1.917401  [ 6400/60000]
loss: 1.800668  [12800/60000]
loss: 1.843996  [19200/60000]
loss: 1.730411  [25600/60000]
loss: 1.683991  [32000/60000]
loss: 1.713953  [38400/60000]
loss: 1.609846  [44800/60000]
loss: 1.642476  [51200/60000]
loss: 1.524322  [57600/60000]
Test Error:
 Accuracy: 61.9%, Avg loss: 1.542315

Epoch 4
-------------------------------
loss: 1.617807  [    0/60000]
loss: 1.573873  [ 6400/60000]
loss: 1.422307  [12800/60000]
loss: 1.488942  [19200/60000]
loss: 1.365029  [25600/60000]
loss: 1.368474  [32000/60000]
loss: 1.384097  [38400/60000]
loss: 1.302788  [44800/60000]
loss: 1.340683  [51200/60000]
loss: 1.232233  [57600/60000]
Test Error:
 Accuracy: 63.3%, Avg loss: 1.261218

Epoch 5
-------------------------------
loss: 1.349695  [    0/60000]
loss: 1.323989  [ 6400/60000]
loss: 1.154870  [12800/60000]
loss: 1.255640  [19200/60000]
loss: 1.129983  [25600/60000]
loss: 1.162986  [32000/60000]
loss: 1.186102  [38400/60000]
loss: 1.115779  [44800/60000]
loss: 1.153657  [51200/60000]
loss: 1.067661  [57600/60000]
Test Error:
 Accuracy: 64.6%, Avg loss: 1.091036

Done!
Saved PyTorch Model State to model.pth
Predicted: "Ankle boot", Actual: "Ankle boot"

Back