TensorFlow on Frontera CLX
Building on the general instructions presented previously for managing Python environments at TACC, we will use the pip package manager to install TensorFlow and associated dependencies for use on the CPU-only CLX nodes on Frontera. We will mention specific version numbers, corresponding to what is currently installed on Frontera at the time of this writing. Some of those installed versions might change over time, in which case you might need to inspect the list of available modules. See our companion material on the module utility for more information.
Building a virtual environment
Given the versions of python3 currently installed on Frontera, we will create and populate a python virtual environment to run TensorFlow version 2.8.2, named tf282cpu
. (You could choose a different name for the environment.) The commands below will install tensorflow-cpu, the cpu-only version of the package. (See below for a discussion of running the cpu+gpu version on the CPU-only CLX nodes.) We're going to provide information on how to run this on the development queue on Frontera.
It is worth highlighting a few points about the commands above:
- The module load command will load the default python3 module, which is version 3.7.0 at the time of this writing. If you would like to run with a different python version, you can see what is available via
module avail
ormodule spider
and then modify the load command to add a specific version number. - The pip upgrade is necessary because Tensorflow 2 packages require a pip version > 19.0, as per the tensorflow.org install page. It's generally always a good idea to upgrade pip when you create a new environment.
- The pip command to install tensorflow-cpu specifies that you want the cpu-only version of the code, which is sufficient if you are only planning to run on the CPU-only Frontera CLX nodes. You don't need to restrict yourself to the cpu-only version, since the general version of the package (
tensorflow==2.8.2
) is able to auto-detect whether there are GPUs available, but can still run even if they are not. With the general version, however, you will see additional warning messages about not being able to find a GPU, which can be distracting. Therefore, if you know that you will be running only on the CLX nodes, you can simplify things a bit by installing tensorflow-cpu. - While this is a small environment designed to run the code example below, you might want to install more packages to provide additional functionality for data processing, visualization, etc. (e.g., pandas, matplotlib).
Running the code example
You should have already downloaded the sample code mnist_convnet.py
from our TensorFlow/Keras Example page. If you have not, please do so now.
Depending on whether you want to run the example interactively or in batch, you will need to select the correct slurm and/or idev commands. See our companion material on Slurm for further information.
If you want to run the code interactively, you will want to use the TACC shell command idev
to request a node on one of the available partitions.
For a short-running program like this, you can request a node on the development
queue:
Once you've been allocated a compute node, make sure that you've activated the tf282cpu environment that you created above, and navigate to the directory where the code example is situated. Now you can run the program as such:
You should observe output such as that reproduced at the bottom of this page. (You might also see some other informational messages.)
Alternatively, if you'd like to submit a batch job to run this code, you could use a slurm submission script containing the following commands
(where you replace /path/to/virtualenvironment
to wherever your python virtual environment is located):
If you copy and paste the above commands into a file called submit_mnist_convnet_frontera_clx.sh
and edit the path to VENVDIR
, you can submit a batch job via slurm with the following command:
Once this has finished running, you can examine that output that was redirected to the file log_mnist_convnet_frontera_clx.txt
. You should observe output similar to what is reproduced at the bottom of this page. But you might also see lots of extraneous characters in the output file, rendered as ^H
. These arise from the fact that the default level of verbosity for the model.fit
method is verbose='auto'
, which defaults to displaying a progress bar. While that option is useful for providing visual feedback about the program's progress when run interactively, it causes the insertion of many additional characters when redirected to a file. You might want to experiment with some different values of the verbose
parameter to find a better option for saving output in a text file.
If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.
Output from mnist_convnet.py
After setting up the model and printing out a model summary, the program spends most of its time training parameters based on the data through a series of 15 epochs, eventually reaching a test accuracy and a validation accuracy close to 1.0. On the Frontera CLX nodes, each epoch takes roughly 4s (8-9 ms/step).
CPU+GPU code on CPU-only node
As noted above, we have instructed you to configure an environment with the cpu-only tensorflow-cpu package installed if you are planning to run on the Frontera CLX nodes, but you could also run the alternate gpu-enabled tensorflow package that you created in a different environment as part of TensorFlow on Frontera GPU. If you activate the tf282
environment that you previously created and then run the mnist_convnet.py
, you will see that it can run completely, but that it will issue warning messages about not detecting any GPUs.
If you anticipate running on both the CLX nodes and the RTX/GPU nodes, you might find it preferable to just build one environment that can run on either subsystem, and suffer through the additional warning messages, rather than building and maintaining two separate environments.