Cornell Virtual Workshop > AI with Deep Learning

Distributed Training

Chris Myers
Cornell Center for Advanced Computing

6/2023 (original)

Distributed training involves using parallelism to distribute the training phase of a deep learning pipeline across multiple computational resources, either to accelerate the training process, access sufficient memory for a very large model, or both. In this topic, we consider the data-parallel training of deep learning models across multiple GPUs, a typical use case that can make effective use of the multiple attached GPU nodes on Frontera at TACC.

Objectives

After you complete this segment, you should be able to:

Explain the differences between data parallel and model parallel training
Discuss differences in the ways that TensorFlow, PyTorch, and Horovod provide support for distributed training

Prerequisites

In order to run programs on TACC systems, you need an allocation for service units on whichever system(s) you choose to run on. In addition, if you are unfamiliar with some of the mechanics of working on TACC systems, you might want to consult the topic on "Deep Learning at TACC" and the links contained therein.