Cornell Virtual Workshop > AI with Deep Learning > Running PyTorch

PyTorch on Frontera GPU

Building on the general instructions presented previously for managing Python Installations at TACC, we will use the pip package manager to install PyTorch and associated dependencies on Frontera. We will mention specific version numbers, corresponding to what is currently installed on Frontera and through PyTorch at the time of this writing. To examine what is installed on Frontera, please consult our companion material on the module utility for more information. Since Frontera contains nodes with attached GPUs, we will illustrate the installation of the cpu+gpu-enabled version of PyTorch, but if you wanted to run on the Frontera CPU nodes without attached GPUs, the same installation will work for those.

Building a virtual environment

In order to create and populate a python virtual environment for running PyTorch, you can execute commands from the command-line shell on Frontera such as listed in the code block below. We're going to build an environment using PyTorch version 2.0.0, and will name the environment pytorch200, but you can give it a different name or install a different PyTorch version if you prefer. (See the PyTorch "Start Locally" page for additional information about versions.) We're going to use PyTorch 2.0.0 in conjunction with CUDA 11.3:

module load python3/3.9.2
python3 -m venv pytorch200
source pytorch200/bin/activate
pip3 install --upgrade pip
pip3 install torch==2.0.0 torchvision torchaudio urllib3==1.26.14 --extra-index-url https://download.pytorch.org/whl/cu113

It is worth highlighting a few points about the commands above:

The module load python3/3.9.2 command will load Python version 3.9.2. PyTorch 2.0.0 requires Python >= 3.8. Note that as of the time of this writing, the default python3 module on Frontera is 3.7.0.
The call to source pytorch200/bin/activate activates the environment that you have just created. You should notice that your command prompt has changed to reflect the environment you are now running. You can run deactivate from the command line to exit this environment, say, if you wanted to activate a different environment.
The specification of urllib3==1.26.14 might be necessary since dependency requirements for more recent versions of that package are in conflict with the available python3 modules on Frontera. (The urllib3 package is a dependency of the other packages involved in this install.) It is possible that when more recent python3 versions become available on Frontera, this specific urllib3 version might not need to be specified.
While this is a minimal environment designed to run the code example below, you might want to install more packages to provide additional functionality for data processing, visualization, etc. (e.g., pandas, matplotlib, scikit-learn).

Running the code example

You should have already downloaded the sample code from our PyTorch Example page. If you have not, please do so now. In what follows below, we'll assume that we have saved that code to a file named quickstart_tutorial.py.

Depending on whether you want to run the example interactively or in batch, you will need to select the correct slurm and/or idev commands. See our companion material on Slurm for further information. If you want to run the code interactively, you will want to use the TACC shell command idev to request a node on one of the available partitions. If you wanted to run the example code on the Frontera CPU nodes, you could request the development queue. To access the GPU nodes for a small, short program like this, you can request a node on the rtx-dev queue:

idev -p rtx-dev

Once you've been allocated a compute node, make sure that you've activated the pytorch200 environment that you created above, and navigate to the directory where the code example is situated. Now you can run the program as such:

python3 quickstart_tutorial.py

You should observe output such as that reproduced at the bottom of this page. (You might also see some other informational messages.)

Alternatively, if you'd like to submit a batch job to run this code, you could use a slurm submission file containing the following commands (where you replace /path/to/virtualenvironment to wherever your python virtual environment is located):

#!/bin/bash

export VENVDIR=/path/to/virtualenvironment
source $VENVDIR/bin/activate

python3 quickstart_tutorial.py > log_quickstart_tutorial_frontera.txt

If you copy and paste the above commands into a file called submit_quickstart_tutorial_frontera_gpu.sh, you can submit a batch job via slurm with the following command:

sbatch -p rtx-dev -t 00:10:00 -N 1 -n 1 submit_quickstart_tutorial_frontera_gpu.sh

If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.

Output from quickstart_tutorial.py

After setting up the model and printing out a model summary, the program spends most of its time training parameters based on the data through a series of 5 epochs, eventually reaching a test accuracy of approximately 64% . Unlike the TensorFlow/Keras example we considered, this code does not automatically report the time required for each epoch to complete. (Hint: if you wanted to report the time, you could instrument some calls to the python time.time() function around the train+test steps for each of the epochs, and print out the time required for each pass.)

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using gpu device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 2.308048  [    0/60000]
loss: 2.298382  [ 6400/60000]
loss: 2.277428  [12800/60000]
loss: 2.273406  [19200/60000]
loss: 2.260750  [25600/60000]
loss: 2.224988  [32000/60000]
loss: 2.238995  [38400/60000]
loss: 2.201748  [44800/60000]
loss: 2.205894  [51200/60000]
loss: 2.177154  [57600/60000]
Test Error:
 Accuracy: 44.0%, Avg loss: 2.166817

Epoch 2
-------------------------------
loss: 2.176655  [    0/60000]
loss: 2.167589  [ 6400/60000]
loss: 2.110211  [12800/60000]
loss: 2.132320  [19200/60000]
loss: 2.082440  [25600/60000]
loss: 2.022047  [32000/60000]
loss: 2.059790  [38400/60000]
loss: 1.980064  [44800/60000]
loss: 1.996935  [51200/60000]
loss: 1.918666  [57600/60000]
Test Error:
 Accuracy: 57.4%, Avg loss: 1.914325

Epoch 3
-------------------------------
loss: 1.951239  [    0/60000]
loss: 1.917401  [ 6400/60000]
loss: 1.800668  [12800/60000]
loss: 1.843996  [19200/60000]
loss: 1.730411  [25600/60000]
loss: 1.683991  [32000/60000]
loss: 1.713953  [38400/60000]
loss: 1.609846  [44800/60000]
loss: 1.642476  [51200/60000]
loss: 1.524322  [57600/60000]
Test Error:
 Accuracy: 61.9%, Avg loss: 1.542315

Epoch 4
-------------------------------
loss: 1.617807  [    0/60000]
loss: 1.573873  [ 6400/60000]
loss: 1.422307  [12800/60000]
loss: 1.488942  [19200/60000]
loss: 1.365029  [25600/60000]
loss: 1.368474  [32000/60000]
loss: 1.384097  [38400/60000]
loss: 1.302788  [44800/60000]
loss: 1.340683  [51200/60000]
loss: 1.232233  [57600/60000]
Test Error:
 Accuracy: 63.3%, Avg loss: 1.261218

Epoch 5
-------------------------------
loss: 1.349695  [    0/60000]
loss: 1.323989  [ 6400/60000]
loss: 1.154870  [12800/60000]
loss: 1.255640  [19200/60000]
loss: 1.129983  [25600/60000]
loss: 1.162986  [32000/60000]
loss: 1.186102  [38400/60000]
loss: 1.115779  [44800/60000]
loss: 1.153657  [51200/60000]
loss: 1.067661  [57600/60000]
Test Error:
 Accuracy: 64.6%, Avg loss: 1.091036

Done!
Saved PyTorch Model State to model.pth
Predicted: "Ankle boot", Actual: "Ankle boot"

Back