CUDA
NVIDIA GPUs — such as those on Frontera at TACC — can be programmed using CUDA, a parallel computing architecture and API that enables NVIDIA GPUs to serve as a platform for general-purpose computing with GPUs. CUDA (which stands for "Compute Unified Device Architecture") is a rather low-level API that enables one to move data to GPUs and process it in various ways. As such, CUDA often serves as the underpinning of higher-level libraries and APIs that need to access NVIDIA GPUs.
The CUDA Toolkit provides a set of libraries to support a variety of numerical operations on NVIDIA GPUs. Information about different CUDA versions, as well as the component versions that align with each CUDA version, is available online at https://docs.nvidia.com/cuda/archive/. For any particular CUDA version, the Release Notes linked at those pages contain information about specific component versions.
On systems with CUDA-compatible GPUs, both TensorFlow and PyTorch make use of the CUDA libraries to accelerate deep learning computations. Every installation of TensorFlow and PyTorch is linked against a specific version of the CUDA toolkit. Furthermore, every version of both the TensorFlow and PyTorch toolkits has a required minimum version number of the CUDA toolkit that it needs to be linked against in order to support GPU computations. By "minimum", we mean that TensorFlow or PyTorch need to be built with support for a CUDA version at least as high as the minimum version specified, with higher versions acceptable due to the backward compatibility of CUDA.
CUDA minimum version compatibilities are specified at the following locations:
- TensorFlow-CUDA compatibility: https://www.tensorflow.org/install/source#gpu
- PyTorch-CUDA compatibility: https://pytorch.org/get-started/previous-versions/
TensorFlow and PyTorch take slightly different approaches to how their linkage to CUDA is specified, however. In particular, TensorFlow assumes that there is an available and compatible version of the CUDA toolkit installed on the system of interest. PyTorch, on the other hand, is linked against a specific version of the CUDA toolkit when it is built. Ensuring proper version compatibility might be managed by system administrators on some machines, but is generally required by individual users when doing a local installation of one or more deep learning packages. Installing CUDA can be complicated, and not particularly manageable on shared systems for users without root access.
On TACC systems, CUDA libraries are managed using the Lmod module system. In addition to CUDA, additional NVIDIA libraries are also used by TensorFlow/Keras and PyTorch. These include the NVIDIA CUDA Deep Neural Network Library (cuDNN), and the NVIDIA Collective Communication Library (NCCL). On TACC systems, the cuDNN and NCCL libraries are also managed through the Lmod module system (with module names all in lower-case, as cudnn and nccl). Each distinct CUDA version has an associated cudnn version and nccl version, and the module system will only let you load compatible versions of each. Fortunately the module system is able to associate those related versions by manipulating the MODULEPATH
environment variable, thus sparing you from needing to track down version compatibilities for those additional modules.
For example, if you wanted to load the cuda/11.3 module on Frontera, as well as the associated versions of cudnn and nccl, you could execute the following shell command:
Because the cuda/11.3 module is configured to point to cudnn/8.2.1 and nccl/2.9.9, those specific versions will be loaded, but do not need to be explicitly identified during the module load. Alternatively, if cuda/11.0 were desired instead, the command module load cuda/11.0 cudnn nccl
would load the appropriate versions, which happen to be cudnn/8.0.5 and nccl/2.8.3. If you want, you can specify explicit version numbers for cudnn and nccl, but if either of those conflicts with the specified cuda version, then the module system will indicate an error.