Cornell Virtual Workshop > Checkpointing

C/R Background

Brandon Barker
Cornell Center for Advanced Computing

3/2016 (original)

Losing results mid-way through the execution of a program due to any form of crash or service outage can sometimes be a minor annoyance, but it can also result in large setbacks or be extremely costly. Checkpoint/Restart, sometimes called Checkpoint and Restart (CaR) or simply Checkpointing, is the solution to this problem. C/R involves saving and restoring of a running application's state. Broadly speaking, an application could be a single process or span multiple virtual machines. C/R can be done in an ad-hoc fashion, but there are several types of full C/R solutions. C/R solutions may either use an external tool to save an entire program's state or use custom code or libraries to save only the required components of program state.

Fault tolerance is especially valuable for the long-running codes typical in high performance computing, but HPC also presents unique challenges to C/R implementers. It almost always involves concurrent programming, and there are considerations for both shared-memory and distributed-memory applications. Recently, robust C/R solutions have begun to emerge to tackle these areas. This is great news, since branches of concurrent computing are used in jobs most likely to benefit from C/R: massive, long-running jobs where the cost of lost work is very large, both in terms of time and money.

In this topic we focus on process-level C/R, which along with application-level C/R, is currently one of the most useful types of C/R for HPC. We also give an overview of other types of technologies, like virtual machines and containers, that are related to C/R and discuss use cases and guidelines for each type of C/R or C/R-related technology.

Objectives

After you complete this segment, you should be able to:

List use cases for Checkpoint/Restart
List situations when a C/R solution is helpful
List potential problems with C/R
Name the pros and cons of using ad-hoc C/R

Prerequisites

A basic understanding of Linux and some programming language is the only requirement this topic. For instance, you may be accustomed to doing analyses in R or MATLAB. The principles taught in this topic will still be useful; most of the techniques require very little programming experience.

Additional Checkpoint/Restart references may be helpful but are not required for this roadmap.