As noted on the previous page, the edge weights \(\bar W\) and node biases \(\vec b\) form a set of numerical parameters that can be tuned to alter the mapping of inputs to outputs in a neural network. In typical settings, this tuning is accomplished as part of a process of fitting the neural network model to data. One typically specifies a loss function (or, by another name, an objective function) that is minimized through the tuning of those parameters, when applied to a testing or validation set that is held out from the training procedure. In a regression problem, the loss function might be the mean-squared error between the data and the model predictions. In a classification problem, it might be a categorical cross entropy that indicates how well the model is predicting the class labels of data in the testing set.

Training and Backpropagation

A model is trained — that is, its parameters are estimated from data — by making iterative modifications to the parameter values that generally reduce the loss associated with the current set of parameter values. This is typical of many optimization algorithms in scientific computing, where one wishes to move "downhill" along the loss surface through changes to parameters. Gradient descent is a standard method for such a process, in which one computes gradients of the loss function with respect to model parameters in order to identify the appropriate direction in a high-dimensional parameter space to move. In deep learning, a modified algorithm — stochastic gradient descent (SGD) — is often used to avoid having the optimization process get stuck in local minima. A variety of different optimizers are available in deep learning software packages, each with different strengths and weaknesses for different types of problems. Algorithms can also differ on how often they update parameter values: often subsets of the data are processed in batches of some prescribed size, with parameter updates computed in the aggregate for each batch.

In general, the computation of parametric gradients can be carried out through various means, but for neural networks, a particular approach called backpropagation is used. After a forward pass through the network to map inputs to outputs (given the current set of parameter values), a backwards pass can be carried out from the outputs to the inputs, accumulating the parametric derivatives along the way as an application of the chain rule from calculus. Once derivatives are so computed, an optimization algorithm can make changes to the parameter values, and the process begins again until a desired level of convergence is achieved or a total number of iterations is reached.

Automatic differentiation

Automatic differentiation is a powerful computational technique that enables not just a sequence of operations on data, but also the computation of gradients with respect to parameters resulting from those operations. This is accomplished by recording the sequence of operations applied to data, and by applying the chain rule consistent with that sequence of operations to compute parametric derivatives. Automatic differentiation is used to compute parametric derivatives during backpropagation, and this operation is supported within deep learning software. In TensorFlow (tf), for example, the tf.GradientTape API provides the key support for computation of gradients using automatic differentiation. In PyTorch, torch.autograd provides the key functionality of this sort.

Hyperparameters

Broadly speaking, in constructing and training a model, one specifies a network architecture, a loss function, an optimization algorithm, and a dataset to be learned from. During training, the parameters of the neural network are tuned using the procedure outlined above. In addition, however, there are typically hyperparameters that are not estimated during training but which are specified outside of the training process, either by the user as a design choice or by a separate algorithm that aims to compute optimal parameters. Examples of hyperparameters might include the prescribed batch size for optimizer updates, or the learning rate that indicates how big a step to take downhill once gradients are computed. Hyperparameters can be optimized through various means, mostly involving the use of cross-validation, where the model performance can be assessed for different values of the hyperparameters. Such an optimization represents a type of "outer loop" of a computation, since for each set of hyperparameter values, the "inner loop" of parameter training via backpropagation must be carried out. Given that the inner loop of training can itself be rather computationally expensive for large datasets, a full computation to optimize hyperparameters can require a large amount of computational resources, making it suitable for high-performance systems such as those supported at TACC.

One wants to tune the parameters of a neural network to optimize the mapping of inputs to outputs by learning from data. A classification problem such as recognition of hand-written digits might involve a large number of images of digits, along with their associated labels (e.g., this image is an example of a "2", and that image is an example of an "8"). The goal of training is to tune the parameters of a neural network model to optimize the classification of images of digits which have not yet been seen by the classifier.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement