Cornell Virtual Workshop > AI with Deep Learning > Running TensorFlow/Keras

TensorFlow on Frontera CLX

Building on the general instructions presented previously for managing Python environments at TACC, we will use the pip package manager to install TensorFlow and associated dependencies for use on the CPU-only CLX nodes on Frontera. We will mention specific version numbers, corresponding to what is currently installed on Frontera at the time of this writing. Some of those installed versions might change over time, in which case you might need to inspect the list of available modules. See our companion material on the module utility for more information.

Building a virtual environment

Given the versions of python3 currently installed on Frontera, we will create and populate a python virtual environment to run TensorFlow version 2.8.2, named tf282cpu. (You could choose a different name for the environment.) The commands below will install tensorflow-cpu, the cpu-only version of the package. (See below for a discussion of running the cpu+gpu version on the CPU-only CLX nodes.) We're going to provide information on how to run this on the development queue on Frontera.

module load python3/3.9.2
python3 -m venv tf282cpu
source tf282cpu/bin/activate
pip3 install --upgrade pip
pip3 install tensorflow-cpu==2.8.2 tensorflow_datasets

It is worth highlighting a few points about the commands above:

The module load command will load the default python3 module, which is version 3.7.0 at the time of this writing. If you would like to run with a different python version, you can see what is available via module avail or module spider and then modify the load command to add a specific version number.
The pip upgrade is necessary because Tensorflow 2 packages require a pip version > 19.0, as per the tensorflow.org install page. It's generally always a good idea to upgrade pip when you create a new environment.
The pip command to install tensorflow-cpu specifies that you want the cpu-only version of the code, which is sufficient if you are only planning to run on the CPU-only Frontera CLX nodes. You don't need to restrict yourself to the cpu-only version, since the general version of the package (tensorflow==2.8.2) is able to auto-detect whether there are GPUs available, but can still run even if they are not. With the general version, however, you will see additional warning messages about not being able to find a GPU, which can be distracting. Therefore, if you know that you will be running only on the CLX nodes, you can simplify things a bit by installing tensorflow-cpu.
While this is a small environment designed to run the code example below, you might want to install more packages to provide additional functionality for data processing, visualization, etc. (e.g., pandas, matplotlib).

Running the code example

You should have already downloaded the sample code mnist_convnet.py from our TensorFlow/Keras Example page. If you have not, please do so now.

Depending on whether you want to run the example interactively or in batch, you will need to select the correct slurm and/or idev commands. See our companion material on Slurm for further information. If you want to run the code interactively, you will want to use the TACC shell command idev to request a node on one of the available partitions. For a short-running program like this, you can request a node on the development queue:

idev -p development

Once you've been allocated a compute node, make sure that you've activated the tf282cpu environment that you created above, and navigate to the directory where the code example is situated. Now you can run the program as such:

python3 mnist_convnet.py

You should observe output such as that reproduced at the bottom of this page. (You might also see some other informational messages.)

Alternatively, if you'd like to submit a batch job to run this code, you could use a slurm submission script containing the following commands (where you replace /path/to/virtualenvironment to wherever your python virtual environment is located):

#!/bin/bash

export VENVDIR=/path/to/virtualenvironment
source $VENVDIR/bin/activate

python3 mnist_convnet.py > log_mnist_convnet_frontera_clx.txt

If you copy and paste the above commands into a file called submit_mnist_convnet_frontera_clx.sh and edit the path to VENVDIR, you can submit a batch job via slurm with the following command:

sbatch -p development -t 00:10:00 -N 1 -n 1 submit_mnist_convnet_frontera_clx.sh

Once this has finished running, you can examine that output that was redirected to the file log_mnist_convnet_frontera_clx.txt. You should observe output similar to what is reproduced at the bottom of this page. But you might also see lots of extraneous characters in the output file, rendered as ^H. These arise from the fact that the default level of verbosity for the model.fit method is verbose='auto', which defaults to displaying a progress bar. While that option is useful for providing visual feedback about the program's progress when run interactively, it causes the insertion of many additional characters when redirected to a file. You might want to experiment with some different values of the verbose parameter to find a better option for saving output in a text file.

If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.

Output from mnist_convnet.py

After setting up the model and printing out a model summary, the program spends most of its time training parameters based on the data through a series of 15 epochs, eventually reaching a test accuracy and a validation accuracy close to 1.0. On the Frontera CLX nodes, each epoch takes roughly 4s (8-9 ms/step).

_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
2023-07-06 10:57:09.313871: I tensorflow/core/platform/cpu_feature_guard.cc:151]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 conv2d (Conv2D)             (None, 26, 26, 32)        320

 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0
 )

 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496

 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0
 2D)

 flatten (Flatten)           (None, 1600)              0

 dropout (Dropout)           (None, 1600)              0

 dense (Dense)               (None, 10)                16010

=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
422/422 [===============] - 4s 9ms/step - loss: 0.3670 - accuracy: 0.8884 - val_loss: 0.0812 - val_accuracy: 0.9778
Epoch 2/15
422/422 [===============] - 4s 9ms/step - loss: 0.1118 - accuracy: 0.9662 - val_loss: 0.0557 - val_accuracy: 0.9843
Epoch 3/15
422/422 [===============] - 4s 9ms/step - loss: 0.0855 - accuracy: 0.9736 - val_loss: 0.0487 - val_accuracy: 0.9873
Epoch 4/15
422/422 [===============] - 3s 8ms/step - loss: 0.0723 - accuracy: 0.9779 - val_loss: 0.0415 - val_accuracy: 0.9887
Epoch 5/15
422/422 [===============] - 4s 9ms/step - loss: 0.0645 - accuracy: 0.9799 - val_loss: 0.0393 - val_accuracy: 0.9890
Epoch 6/15
422/422 [===============] - 4s 9ms/step - loss: 0.0577 - accuracy: 0.9825 - val_loss: 0.0363 - val_accuracy: 0.9907
Epoch 7/15
422/422 [===============] - 4s 8ms/step - loss: 0.0510 - accuracy: 0.9844 - val_loss: 0.0368 - val_accuracy: 0.9903
Epoch 8/15
422/422 [===============] - 4s 9ms/step - loss: 0.0499 - accuracy: 0.9838 - val_loss: 0.0338 - val_accuracy: 0.9915
Epoch 9/15
422/422 [===============] - 4s 9ms/step - loss: 0.0477 - accuracy: 0.9851 - val_loss: 0.0326 - val_accuracy: 0.9907
Epoch 10/15
422/422 [===============] - 3s 8ms/step - loss: 0.0424 - accuracy: 0.9865 - val_loss: 0.0331 - val_accuracy: 0.9917
Epoch 11/15
422/422 [===============] - 4s 9ms/step - loss: 0.0416 - accuracy: 0.9863 - val_loss: 0.0298 - val_accuracy: 0.9918
Epoch 12/15
422/422 [===============] - 4s 9ms/step - loss: 0.0399 - accuracy: 0.9871 - val_loss: 0.0325 - val_accuracy: 0.9915
Epoch 13/15
422/422 [===============] - 4s 9ms/step - loss: 0.0379 - accuracy: 0.9879 - val_loss: 0.0329 - val_accuracy: 0.9923
Epoch 14/15
422/422 [===============] - 4s 8ms/step - loss: 0.0362 - accuracy: 0.9878 - val_loss: 0.0321 - val_accuracy: 0.9918
Epoch 15/15
422/422 [===============] - 4s 8ms/step - loss: 0.0343 - accuracy: 0.9886 - val_loss: 0.0277 - val_accuracy: 0.9923
Test loss: 0.024034304544329643
Test accuracy: 0.991599977016449

CPU+GPU code on CPU-only node

As noted above, we have instructed you to configure an environment with the cpu-only tensorflow-cpu package installed if you are planning to run on the Frontera CLX nodes, but you could also run the alternate gpu-enabled tensorflow package that you created in a different environment as part of TensorFlow on Frontera GPU. If you activate the tf282 environment that you previously created and then run the mnist_convnet.py, you will see that it can run completely, but that it will issue warning messages about not detecting any GPUs.

If you anticipate running on both the CLX nodes and the RTX/GPU nodes, you might find it preferable to just build one environment that can run on either subsystem, and suffer through the additional warning messages, rather than building and maintaining two separate environments.

Back