C/R Background

Brandon Barker
Cornell Center for Advanced Computing

3/2016 (original)

Losing results mid-way through the execution of a program due to any form of crash or service outage can sometimes be a minor annoyance, but it can also result in large setbacks or be extremely costly. Checkpoint/Restart, sometimes called Checkpoint and Restart (CaR) or simply Checkpointing, is the solution to this problem. C/R involves saving and restoring of a running application's state. Broadly speaking, an application could be a single process or span multiple virtual machines. C/R can be done in an ad-hoc fashion, but there are several types of full C/R solutions. C/R solutions may either use an external tool to save an entire program's state or use custom code or libraries to save only the required components of program state.

A diagram of the checkpoint restart process in which a computation produces checkpoints as it executes, is interrupted and then restarted from one of the checkpoints.
A diagram of C/R. The following icons from https://thenounproject.com/ are used in the diagram and are distributed under the Creative Commons Attribution License 3.0: Happy, by Julien Deveaux; Dead, by Julien Deveaux; Upload, by Philipp Süß; Save, by Jevgeni Striganov

Fault tolerance is especially valuable for the long-running codes typical in high performance computing, but HPC also presents unique challenges to C/R implementers. It almost always involves concurrent programming, and there are considerations for both shared-memory and distributed-memory applications. Recently, robust C/R solutions have begun to emerge to tackle these areas. This is great news, since branches of concurrent computing are used in jobs most likely to benefit from C/R: massive, long-running jobs where the cost of lost work is very large, both in terms of time and money.

In this topic we focus on process-level C/R, which along with application-level C/R, is currently one of the most useful types of C/R for HPC. We also give an overview of other types of technologies, like virtual machines and containers, that are related to C/R and discuss use cases and guidelines for each type of C/R or C/R-related technology.

Objectives

After you complete this segment, you should be able to:

  • List use cases for Checkpoint/Restart
  • List situations when a C/R solution is helpful
  • List potential problems with C/R
  • Name the pros and cons of using ad-hoc C/R
Prerequisites

A basic understanding of Linux and some programming language is the only requirement this topic. For instance, you may be accustomed to doing analyses in R or MATLAB. The principles taught in this topic will still be useful; most of the techniques require very little programming experience.

Additional Checkpoint/Restart references may be helpful but are not required for this roadmap.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement