Cornell Virtual Workshop > An Overview of AI > Machine Learning and Deep Learning

Concepts in Machine Learning

Machine learning (ML) involves the use of algorithms that can learn about patterns and structure in data, without being specifically instructed about the details of those patterns. Deep learning (DL) is generally considered to be a subfield of machine learning, in that it focuses more specifically on the use of neural networks (sometimes referred to as Artificial Neural Networks, or ANNs) to solve problems in ML. All these fields mix and mingle with elements of the broadly defined field of Data Science, although much of data science involves the human-guided — rather than machine-guided — processing of data. Machine learning is an umbrella that comprises many different types of problems, and many different types of algorithms designed to solve those problems.

Very broadly speaking, ML and DL aim to "learn" how to map a set of inputs to a set of outputs — typically via repeated iteration through data with some sort of feedback to guide the learning. The inputs are often data that are presented to us, and the outputs involve predictions or characterizations that we want to make about those data. In this sense, by taking inputs and producing outputs, ML/DL "machines" are like the computational functions that we write in software or the mathematical functions that we use in analysis, but they are not constructed with explicit instructions about what sorts of functions to implement. Rather, ML/DL machines are built with flexible and expressive computational elements that contain parameters which can be modified to produce different mappings between inputs and outputs. At their core, learning in ML and DL is about modifying or "fitting" these parameters in order to produce useful functional mappings.

When we speak about flexible and expressive computational elements, we are referring to computational objects that can be configured and parameterized in different ways to capture a variety of different functional relationships, as compared to a particular object that encodes a specific functional relationship between inputs and outputs. Some commonly used examples include:

Decision trees
Random forests
Support vector machines
Neural networks

Machine learning comprises a variety of different types of problems, such as those involving supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning

Supervised learning involves data that are labeled, with the aim of training a system to develop a mapping from the underlying data elements to their associated labels, so that predictions about new and unseen data can be made based on this mapping. Supervised learning divides generally into classification — if the data labels represent discrete categorical classes — and regression — if the data labels represent continuous numerical values. Classification involves problems such as identifying letters and digits in images of handwritten text, or distinguishing cancerous from normal cells based on their gene expression patterns. Regression involves problems such as predicting crop yield based on climate and soil conditions, or predicting stock returns based on prior performance and other economic factors. Semi-supervised learning uses a combination of both labeled data (which is useful for training, but sometimes difficult to acquire in sufficient quantity) with unlabeled data (which might be more plentiful, and can be helpful in learning about patterns in the data relevant to the training process).

Unsupervised learning

Unsupervised learning involves data that are not labeled, with the aim of discovering patterns inherent in the data themselves. Such a discovery process is a bit more open-ended in practice: it might involve methods such as clustering — in order to identify subgroups of related items within a large dataset — or dimensionality reduction — to discover a lower-dimensional subspace or representation that some high-dimensional dataset lies in. A powerful set of techniques involve the use of autoencoders, which aim to generate efficient representations (encodings) of unlabeled data. These encodings can be useful for both streamlining aspects of data processing, and providing insight into the core structure of a large dataset by distilling the key aspects to some smaller and lower-dimensional description.

Reinforcement learning

Reinforcement learning involves a system that changes its behavior over time, by getting feedback about what works well or does not, for a given task. Feedback is in the form of positive rewards or negative penalties associated with an agent's behavior in an environment. Strategies or configurations that work well to achieve the task are kept and further modified, while those that do not perform well tend to be discarded. A computer that learns to play games such as chess or Go might do so through a process of reinforcement learning, starting off by playing poorly and losing consistently, but then gradually getting positive feedback about useful strategies in the form of better scores, such that those useful strategies get encoded into the computational model used by the program. (A self-driving car that learned to drive through reinforcement learning might get into lots of accidents and receive a lot of traffic tickets at the start, but hopefully that learning process would be carried out fully in software before being deployed to actual self-driving hardware.)

Within the confines of these broad classes of problems, there are a number of different algorithms that can be employed to carry out such learning, each with their own assumptions, biases, strengths and weaknesses. In addition, there are a number of different software tools that support different ML and DL algorithms.

Across the landscape of ML and DL algorithms and internal representations, there are several recurring concepts worth highlighting.

Parameters

Almost all ML or DL algorithms contain numerical parameters that are learned or fit during the training process. The architecture of an algorithm will lay out the space of possible model behaviors, but the specific numerical values of the model parameters will dictate the particular response or output of a model to particular inputs. In a decision tree, for example, the parameters might be the numerical thresholds used to decide which branch of a tree to descend, whereas in a neural network, the parameters would consist of all the edge weights and node biases that dictate the influence of propagation of information through the network. The process of training a model usually involves iteratively refining the numerical values of model parameters to do as well as possible to fit the model to data, by minimizing a specified loss function (more below).

Hyperparameters

In addition to model parameters that are learned during training, there are typically hyperparameters that are not estimated during training but which are specified outside of the training process, either by the user as a design choice or by a separate algorithm that aims to compute optimal hyperparameters. In deep learning neural networks, examples of hyperparameters might include the prescribed batch size for optimizer updates, or the learning rate that indicates how big a step to take downhill once gradients of the loss function are computed. Hyperparameters can be optimized through various means, many involving the use of cross-validation, where the model performance can be assessed for different values of the hyperparameters. Such an optimization represents a type of "outer loop" of a computation, since for each set of hyperparameter values, the "inner loop" of parameter training must be carried out. Given that the inner loop of training can itself be rather computationally expensive for large datasets, a full computation to optimize hyperparameters can require a large amount of computational resources.

Loss functions

To fit a model to data, one typically specifies a loss function (or, by other names, an objective function or a cost function) that is minimized through the numerical tuning of those parameters. A loss function encodes, at least in part, what constistutes a "good" model or a model that "fits the data well". In a regression problem, the loss function might be the mean-squared error between the data and the model predictions. In a classification problem, it might be a categorical cross entropy that indicates how well the model is predicting the class labels of data in the testing set.

Training, validation, and testing

Training, validation, and testing represent different phases of the machine learning process. Training is the process of tuning the parameters of a model in order to minimize the loss function, that is, to get the model to fit the data as well as possible, based on whatever criterion is specified in the loss function. In order to carry out training, a subset of available data are used to constitute a training set. Validation is the process of working to ensure that a trained model is generalizable, such that a model does not just learn how to represent the data it has been trained on, but the broader space of likely data. This is achieved by holding out a subset of available data — a validation set — for evaluating model performance during the process of training. Finally, testing is the process of assessing the performance of a trained model, by evaluating the quality of the predictions of the model on a withheld subset of available data referred to as the testing set (or test set), where quality generally reflects the criterion specified in the loss function.

Models and inference

A machine learning model consists of everything needed to make predictions from data: a model structure (e.g., a decision tree, or a neural network topology and prescribed activation functions at each node), as well as all the numerical parameters that have been tuned during training. (Hyperparameters, which have been used to provide context to the training process, are not part of the resulting model itself.) Once trained, a model can be applied to make predictions about unseen data. This process of applying a trained ML model to new data to make predictions is sometimes called inference, although that term is somewhat confusing since in other areas of statistics and parameter estimation, "inference" can refer to the process of estimating models or parameters from data. Generally, the training of a model to fit parameters is more time-consuming and computationally expensive than the process of applying to model to make predictions about new data. But for generative AI applications capable of producing copious content, the process of inference or content generation is also becoming increasingly computationally expensive.

A good model that has been trained well will be successful at generalization, that is, in making predictions about data that it was not trained on. In some cases, poor generalization is due to overfitting, where model parameters are trained to learn features (often noisy features) in data that are particular to a specific dataset, rather than learning general trends that might be more broadly applicable.

Once trained, a model can be distributed to others, or transferred to other platforms, for application to other datasets. There are a variety of formats for storing trained models so that they can be applied to data on various platforms, but we will not focus on that aspect here.

Back