TensorFlow on Frontera GPU
Building on the general instructions presented previously for managing Python environments at TACC, we will use the pip package manager to install TensorFlow and associated dependencies on Frontera. We will mention specific version numbers, corresponding to what is currently installed on Frontera at the time of this writing. Some of those installed versions might change over time, in which case you might need to inspect the list of available modules. See our companion material on the module utility for more information.
Building a virtual environment
Frontera is a large, leadership-class cluster composed of a massive number of Intel Cascade Lake (CLX) CPU nodes, and a much smaller number of GPU nodes, each with 4 NVIDIA Quadro RTX 5000 GPUs. If you are interested in running specifically on the CLX CPU nodes, please consult the instructions provided for running either TensorFlow or PyTorch on the Frontera CLX subsystem. (For further information about the Intel Xeon CPUs on Frontera, please consult our companion material on Advanced Cluster Architectures.) On this page, we will focus on using the GPU nodes on Frontera.
Given the versions of python3 and cuda currently installed on Frontera, we will create and populate a python virtual environment to run TensorFlow version 2.8.2, named tf282
. (You could choose a different name for the environment.) The commands below will install tensorflow, which can run on both cpu-only and cpu+gpu machines. We're going to provide information on how to run this on the GPU nodes on Frontera (accessible through the rtx or rtx-dev queues), but as noted you could also run on Frontera in cpu-only mode using the main Cascade Lake (CLX) nodes. It is worth noting that the GPU-attached nodes on Frontera have CPUs ("Broadwell") that are slightly less capable than the CLX processors, since the assumption is that you will be off-loading most of the computation on those nodes to the GPUs.
After having logged in to Frontera and navigated to where you'd like to build your virtual environment, you can execute the commands below in the shell. Note that a python virtual environment cannot be easily moved once it is created, since the file path where you create it gets hardwired into scripts that are generated. If you decide you'd like to have the environment located somewhere else of have a different name, it is easier just to remove the old environment and build a new one. Note also that the following commands can be run on the Frontera login nodes to configure the environment, although you will need to request one or more compute nodes before running the code example, as described below.
Let's highlight a few points about the commands above:
- The pip upgrade is necessary because Tensorflow 2 packages require a pip version > 19.0, as per the tensorflow.org install page. It's generally always a good idea to upgrade pip when you create a new environment.
- While this is a small environment designed to run the code example below, you might want to install more packages to provide additional functionality for data processing, visualization, etc. (e.g., pandas, matplotlib, scikit-learn).
Running the code example
You should have already downloaded the sample code mnist_convnet.py
from our TensorFlow/Keras Example page. If you have not, please do so now.
Depending on whether you want to run the example interactively or in batch, you will need to select the correct slurm and/or idev commands. See our companion material on Slurm for further information. If you want to run the code interactively, you will want to use the TACC shell command idev
to request a node on one of the available partitions. For a short-running program like this, you can request a node on the rtx-dev
queue, which provides access to the GPU-attached nodes:
Once you've been allocated a compute node, make sure that you've activated the tf282 environment that you created above, and navigate to the directory where the code example is situated. Now you can run the program as such:
You should observe output such as that reproduced at the bottom of this page. (You might also see some other informational messages.)
Alternatively, if you'd like to submit a batch job to run this code, you could use a slurm submission script containing the following commands (where you replace /path/to/virtualenvironment
to wherever your python virtual environment is located):
If you copy and paste the above commands into a file called submit_mnist_convnet_frontera_gpu.sh
and edit the path to VENVDIR
, you can submit a batch job via slurm with the following command:
Once this has finished running, you can examine that output that was redirected to the file log_mnist_convnet_frontera_gpu.txt
. You should observe output similar to what is reproduced at the bottom of this page. But you might also see lots of extraneous characters in the output file, rendered as ^H
. These arise from the fact that the default level of verbosity for the model.fit
method is verbose='auto'
, which defaults to displaying a progress bar. While that option is useful for providing visual feedback about the program's progress when run interactively, it causes the insertion of many additional characters when redirected to a file. You might want to experiment with some different values of the verbose
parameter to find a better option for saving output in a text file.
If for some reason you encounter errors when building the environment or running the code, consult the troubleshooting page to see if those errors are addressed.
Output from mnist_convnet.py
After setting up the model and printing out a model summary, the program spends most of its time training parameters based on the data through a series of 15 epochs, eventually reaching a test accuracy and validation accuracy close to 1.0. Note that on the Frontera gpu-attached nodes (rtx-dev), each epoch takes roughly 1s (3-4 ms/step), which is approximately 2 times faster than we observed on the Frontera CLX nodes. For larger datasets, you are likely to see a much larger speedup in moving from CPU to GPU, but since this is not a very large problem, we are probably not feeding the GPU with enough data to fully leverage its capabilities.
Note that TensorFlow created devices for all 4 of the GPUs attached to our compute node, but it only actually used 1 of those GPUs. You can verify that by opening a second terminal, ssh'ing into the compute node that you've been allocated, and then running nvidia-smi
while the mnint_convnet program is running. You should see that only 1 GPU (id 0) is being accessed for this computation, even though you are being charged for access to all of them.
GPU vs CPU: further investigation
If you're interested in exploring further, you could carry out some of the following sorts of tests and exercises.
- If you've forgotten to load the cuda module, or if you were to execute the command
module unload cuda
, the code would still run, but in cpu-only mode. Because you are running the general cpu+gpu version of tensorflow, you will lots of warning messages about not being able to locate cuda libraries and initialize GPUs. - First try running a cpu-only version of TensorFlow by unloading the cuda module on the same rtx-dev node that you allocated previously. Note the slowdown as compared to the GPU-enabled run shown above.
- Next, exit out of your node in the rtx-dev queue, and grab a new node in the CPU-only development queue. Now, run the mnist_convnet code again, and note the time per step during the training phase.
- You will probably observe that the GPU-enabled version on the rtx-dev queue is fastest, followed by the CPU-only version on the development queue, trailed by the missing-cuda version on the rtx-dev queue.
- For this small problem size, the GPU-enabled version is only slightly faster than the CPU-only version on the development queue. The missing-cuda version on the rtx-dev queue is the slowest, since the Broadwell-based CPUs on the rtx-dev queue are less capable than the CLX nodes in the development queue.
- For larger problems and datasets, we would expect to see a substantially greater training speedup in moving from CPUs to GPUs.