Hardware Considerations
Deep learning methods can be executed on any general purpose computer, but they are particularly well-suited to the hardware acceleration that is provided by graphics processing units (GPUs), as well as related and more specialized hardware targeted specifically for deep learning. This is because the computational action of neural networks — multiplying node outputs by edge weights and adding biases or offsets, either in forward or backward propagation through the network — are linear algebra operations that can be parallelized over many data streams in a large network. Fundamentally, these kinds of operations are what GPUs were originally built to support, albeit in the context of processing graphics on a computer display. Compared to central processing units (CPUs), GPUs have many more processing units and higher memory bandwidth, while CPUs have more complex instruction sets and faster clock speeds. These issues are discussed in our companion material on GPU architectures — see the section on GPU characteristics for more detail on the differences between GPUs and CPUs. Since the training phase of machine learning can involve many iterated forward and backward passes through the network in order to fit a large number of training samples, GPUs can be especially important for training.
Popular deep learning frameworks — such as the TensorFlow/Keras and PyTorch packages that we will examine here — can run solely on CPUs, or can make use of one or more attached GPUs if they are available. If a GPU is detected, then it will be used for specific numerical operations that can exploit that hardware. Unless specifically instructed, neither TensorFlow/Keras nor PyTorch will automatically make use of multiple GPUs attached to a single CPU, such as is configured on the Frontera system at TACC. If multiple GPUs are available, they can be accessed through various interfaces that support distributed training, which will be discussed in more detail in a later section. Very large jobs can further leverage multiple CPU workers, each with multiple attached GPUs, for even greater parallelism.
As with any computation, the actual performance of your deep learning code will depend on many factors, including the ability to feed enough computational work to the processing units quickly enough. Therefore, the ability to monitor resource utilization is important, so that you can effectively use resources that you have been allocated. If you are working with small datasets, such as the ones that we use as examples in this tutorial, you might not require the computational power of multi-GPU nodes such as those at TACC. In order to use such resources efficiently, therefore, you might want to explore other methods of parallelism, such as running multiple single-GPU jobs concurrently on an allocated multi-GPU node. Or your work might not require a GPU at all. In general, you should experiment with different problem sizes and configurations, monitor the performance of your code, document your findings, and cultivate an empirical perspective to understand what resources your programs need and how best to optimize the time and resources required to answer your research questions of interest.
Once trained, a deep learning model for making predictions from new data can potentially be run on much less powerful hardware (including, say, a mobile phone).