Cornell Virtual Workshop > AI with Deep Learning > Running TensorFlow/Keras

TensorFlow on Frontera GPU

Building on the general instructions presented previously for managing Python environments at TACC, we will use the pip package manager to install TensorFlow and associated dependencies on Frontera. We will mention specific version numbers, corresponding to what is currently installed on Frontera at the time of this writing. Some of those installed versions might change over time, in which case you might need to inspect the list of available modules. See our companion material on the module utility for more information.

Building a virtual environment

Frontera is a large, leadership-class cluster composed of a massive number of Intel Cascade Lake (CLX) CPU nodes, and a much smaller number of GPU nodes, each with 4 NVIDIA Quadro RTX 5000 GPUs. If you are interested in running specifically on the CLX CPU nodes, please consult the instructions provided for running either TensorFlow or PyTorch on the Frontera CLX subsystem. (For further information about the Intel Xeon CPUs on Frontera, please consult our companion material on Advanced Cluster Architectures.) On this page, we will focus on using the GPU nodes on Frontera.

Given the versions of python3 and cuda currently installed on Frontera, we will create and populate a python virtual environment to run TensorFlow version 2.8.2, named tf282. (You could choose a different name for the environment.) The commands below will install tensorflow, which can run on both cpu-only and cpu+gpu machines. We're going to provide information on how to run this on the GPU nodes on Frontera (accessible through the rtx or rtx-dev queues), but as noted you could also run on Frontera in cpu-only mode using the main Cascade Lake (CLX) nodes. It is worth noting that the GPU-attached nodes on Frontera have CPUs ("Broadwell") that are slightly less capable than the CLX processors, since the assumption is that you will be off-loading most of the computation on those nodes to the GPUs.

After having logged in to Frontera and navigated to where you'd like to build your virtual environment, you can execute the commands below in the shell. Note that a python virtual environment cannot be easily moved once it is created, since the file path where you create it gets hardwired into scripts that are generated. If you decide you'd like to have the environment located somewhere else of have a different name, it is easier just to remove the old environment and build a new one. Note also that the following commands can be run on the Frontera login nodes to configure the environment, although you will need to request one or more compute nodes before running the code example, as described below.

module load python3/3.9.2
python3 -m venv tf282
source tf282/bin/activate
pip3 install --upgrade pip
pip3 install tensorflow==2.8.2 tensorflow_datasets

Let's highlight a few points about the commands above:

The pip upgrade is necessary because Tensorflow 2 packages require a pip version > 19.0, as per the tensorflow.org install page. It's generally always a good idea to upgrade pip when you create a new environment.
While this is a small environment designed to run the code example below, you might want to install more packages to provide additional functionality for data processing, visualization, etc. (e.g., pandas, matplotlib, scikit-learn).

Running the code example

You should have already downloaded the sample code mnist_convnet.py from our TensorFlow/Keras Example page. If you have not, please do so now.

Depending on whether you want to run the example interactively or in batch, you will need to select the correct slurm and/or idev commands. See our companion material on Slurm for further information. If you want to run the code interactively, you will want to use the TACC shell command idev to request a node on one of the available partitions. For a short-running program like this, you can request a node on the rtx-dev queue, which provides access to the GPU-attached nodes:

idev -p rtx-dev

Once you've been allocated a compute node, make sure that you've activated the tf282 environment that you created above, and navigate to the directory where the code example is situated. Now you can run the program as such:

module load cuda/11.3 cudnn nccl
module load phdf5/1.10.4
python3 mnist_convnet.py

You should observe output such as that reproduced at the bottom of this page. (You might also see some other informational messages.)

Alternatively, if you'd like to submit a batch job to run this code, you could use a slurm submission script containing the following commands (where you replace /path/to/virtualenvironment to wherever your python virtual environment is located):

#!/bin/bash

export VENVDIR=/path/to/virtualenvironment
source $VENVDIR/bin/activate
module load cuda/11.3 cudnn nccl

python3 mnist_convnet.py > log_mnist_convnet_frontera_gpu.txt

If you copy and paste the above commands into a file called submit_mnist_convnet_frontera_gpu.sh and edit the path to VENVDIR, you can submit a batch job via slurm with the following command:

sbatch -p rtx-dev -t 00:10:00 -N 1 -n 1 submit_mnist_convnet_frontera_gpu.sh

Once this has finished running, you can examine that output that was redirected to the file log_mnist_convnet_frontera_gpu.txt. You should observe output similar to what is reproduced at the bottom of this page. But you might also see lots of extraneous characters in the output file, rendered as ^H. These arise from the fact that the default level of verbosity for the model.fit method is verbose='auto', which defaults to displaying a progress bar. While that option is useful for providing visual feedback about the program's progress when run interactively, it causes the insertion of many additional characters when redirected to a file. You might want to experiment with some different values of the verbose parameter to find a better option for saving output in a text file.

If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.

Output from mnist_convnet.py

After setting up the model and printing out a model summary, the program spends most of its time training parameters based on the data through a series of 15 epochs, eventually reaching a test accuracy and validation accuracy close to 1.0. Note that on the Frontera gpu-attached nodes (rtx-dev), each epoch takes roughly 1s (3-4 ms/step), which is approximately 2 times faster than we observed on the Frontera CLX nodes. For larger datasets, you are likely to see a much larger speedup in moving from CPU to GPU, but since this is not a very large problem, we are probably not feeding the GPU with enough data to fully leverage its capabilities.

Note that TensorFlow created devices for all 4 of the GPUs attached to our compute node, but it only actually used 1 of those GPUs. You can verify that by opening a second terminal, ssh'ing into the compute node that you've been allocated, and then running nvidia-smi while the mnint_convnet program is running. You should see that only 1 GPU (id 0) is being accessed for this computation, even though you are being charged for access to all of them.

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
2022-05-23 14:12:21.899779: I tensorflow/core/platform/cpu_feature_guard.cc:151]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-23 14:12:31.497350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525]
Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14793 MB memory:
-> device: 0, name: Quadro RTX 5000, pci bus id: 0000:02:00.0, compute capability: 7.5
2022-05-23 14:12:31.590863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525]
Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14793 MB memory:
-> device: 1, name: Quadro RTX 5000, pci bus id: 0000:03:00.0, compute capability: 7.5
2022-05-23 14:12:31.592208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525]
Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14793 MB memory:
-> device: 2, name: Quadro RTX 5000, pci bus id: 0000:82:00.0, compute capability: 7.5
2022-05-23 14:12:31.593613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525]
Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14793 MB memory:
-> device: 3, name: Quadro RTX 5000, pci bus id: 0000:83:00.0, compute capability: 7.5
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 conv2d (Conv2D)             (None, 26, 26, 32)        320

 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0
 )

 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496

 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0
 2D)

 flatten (Flatten)           (None, 1600)              0

 dropout (Dropout)           (None, 1600)              0

 dense (Dense)               (None, 10)                16010

=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
2022-05-23 14:12:38.987183: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
422/422 [====================] - 10s 4ms/step - loss: 0.3676 - accuracy: 0.8876 - val_loss: 0.0841 - val_accuracy: 0.9775
Epoch 2/15
422/422 [====================] - 1s 4ms/step - loss: 0.1137 - accuracy: 0.9660 - val_loss: 0.0582 - val_accuracy: 0.9837
Epoch 3/15
422/422 [====================] - 2s 4ms/step - loss: 0.0831 - accuracy: 0.9741 - val_loss: 0.0465 - val_accuracy: 0.9887
Epoch 4/15
422/422 [====================] - 1s 3ms/step - loss: 0.0692 - accuracy: 0.9780 - val_loss: 0.0421 - val_accuracy: 0.9885
Epoch 5/15
422/422 [====================] - 1s 4ms/step - loss: 0.0612 - accuracy: 0.9811 - val_loss: 0.0392 - val_accuracy: 0.9880
Epoch 6/15
422/422 [====================] - 1s 3ms/step - loss: 0.0552 - accuracy: 0.9826 - val_loss: 0.0353 - val_accuracy: 0.9897
Epoch 7/15
422/422 [====================] - 1s 4ms/step - loss: 0.0484 - accuracy: 0.9844 - val_loss: 0.0355 - val_accuracy: 0.9908
Epoch 8/15
422/422 [====================] - 1s 4ms/step - loss: 0.0460 - accuracy: 0.9857 - val_loss: 0.0348 - val_accuracy: 0.9898
Epoch 9/15
422/422 [====================] - 1s 3ms/step - loss: 0.0436 - accuracy: 0.9863 - val_loss: 0.0330 - val_accuracy: 0.9912
Epoch 10/15
422/422 [====================] - 1s 3ms/step - loss: 0.0401 - accuracy: 0.9871 - val_loss: 0.0305 - val_accuracy: 0.9923
Epoch 11/15
422/422 [====================] - 2s 4ms/step - loss: 0.0373 - accuracy: 0.9881 - val_loss: 0.0298 - val_accuracy: 0.9917
Epoch 12/15
422/422 [====================] - 1s 3ms/step - loss: 0.0368 - accuracy: 0.9887 - val_loss: 0.0303 - val_accuracy: 0.9918
Epoch 13/15
422/422 [====================] - 1s 3ms/step - loss: 0.0347 - accuracy: 0.9888 - val_loss: 0.0305 - val_accuracy: 0.9923
Epoch 14/15
422/422 [====================] - 1s 4ms/step - loss: 0.0335 - accuracy: 0.9889 - val_loss: 0.0276 - val_accuracy: 0.9925
Epoch 15/15
422/422 [====================] - 1s 3ms/step - loss: 0.0307 - accuracy: 0.9899 - val_loss: 0.0281 - val_accuracy: 0.9920
Test loss: 0.022877175360918045
Test accuracy: 0.9926999807357788

GPU vs CPU: further investigation

If you're interested in exploring further, you could carry out some of the following sorts of tests and exercises.

If you've forgotten to load the cuda module, or if you were to execute the command module unload cuda, the code would still run, but in cpu-only mode. Because you are running the general cpu+gpu version of tensorflow, you will lots of warning messages about not being able to locate cuda libraries and initialize GPUs.
First try running a cpu-only version of TensorFlow by unloading the cuda module on the same rtx-dev node that you allocated previously. Note the slowdown as compared to the GPU-enabled run shown above.
Next, exit out of your node in the rtx-dev queue, and grab a new node in the CPU-only development queue. Now, run the mnist_convnet code again, and note the time per step during the training phase.
You will probably observe that the GPU-enabled version on the rtx-dev queue is fastest, followed by the CPU-only version on the development queue, trailed by the missing-cuda version on the rtx-dev queue.
For this small problem size, the GPU-enabled version is only slightly faster than the CPU-only version on the development queue. The missing-cuda version on the rtx-dev queue is the slowest, since the Broadwell-based CPUs on the rtx-dev queue are less capable than the CLX nodes in the development queue.
For larger problems and datasets, we would expect to see a substantially greater training speedup in moving from CPUs to GPUs.

Back