A deep learning model consists of everything needed to make predictions from data: a neural network topology, prescribed activation functions at each node, and numerical parameters representing edge weights and node biases. (Other aspects of the deep learning process — such as optimizers, parametric derivatives, and hyperparameters — are involved in the training of a model, but are not needed once the parameter values have been set.) Much effort, both intellectual and computational, goes into training neural networks, but once that training is complete, that model can be applied to make predictions about unseen data. This process of applying a trained ML model to new data to make predictions is sometimes called inference, although that term is somewhat confusing since in other areas of statistics and parameter estimation, "inference" can refer to the process of estimating models or parameters from data. Generally, the training of a model to fit parameters is more time-consuming and computationally expensive than the process of applying to model to make predictions about new data. This is because training is an iterative processes making repeated forward and backward passes through the network, whereas making a prediction about a new data input typically requires just a single forward pass through the network.

A good model that has been trained well will be successful at generalization, that is, in making predictions about data that it was not trained on. In some cases, poor generalization is due to overfitting, where model parameters are trained to learn features (often noisy features) in data that are particular to a specific dataset. Despite the fact that neural networks used in deep learning can contain many parameters (which can sometimes lead to overfitting), these networks often show surprisingly good generalization. Research among many groups is ongoing to better understand the geometric nature of the high-dimensional parameter spaces underlying deep neural networks and the processes of parameter estimation that lead to effective generalization of trained models.

Once trained, a model can be distributed to others, or transferred to other platforms, for application to other datasets. There are a variety of formats for storing trained models so that they can be applied to data on various platforms, but we will not focus so much on that aspect here.

Much research involves the use of large and complex models already trained on massive datasets, and distributed to the broader community for reuse. This is particularly the case for image-based models used for computer vision and image classification, and for Large Language Models (LLMs) used in a variety of natural language processing (NLP) tasks. Popular image classification models include VGG-16 and ResNet50, and popular language models and applications include GPT-3, GPT-4, ChatGPT, and BERT. In addition to being trained on very large datasets, these models also include architectural elements specific to the problem domain under consideration. Image classification models, for example, are often implemented as Convolutional Neural Networks (CNNs) that make heavy use of convolutional layers and associated elements such as pooling layers to capture image features in an efficient and translation-invariant manner. Many LLMs use techniques such as attention mechanisms and transformer architectures to compute representations relating input to output sequences with long-range structural associations. If you want to make use of a specific pre-trained model, you will want to investigate which deep learning framework(s) provide access to that model. In general, it seems that PyTorch provides more comprehensive access to large, pre-trained models than does TensorFlow/Keras, but that might vary by application domain.

While some pre-trained models can be used as-is to make predictions about new data, sometimes it is worthwhile or necessary to incrementally fine-tune or re-train an existing model. Perhaps there is a new set of training data that would be good to have a model incorporate, or there is a slightly different problem formulation that you need to adapt to an existing model. In these sorts of cases, rather than starting from scratch and re-training the full model, it makes more sense to begin from the existing parameterization and have training proceed from there. Not all model parameters need to be subjected to re-training, and software systems for deep learning allow one to specify which layers are trainable (and which ones are frozen at their current parameterization). If you are adding new network layers to an existing model, for example, to modify the outputs of a model, you might choose to make only these new layers trainable.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement