Types of C/R

Brandon Barker
Cornell Center for Advanced Computing

Revisions: 1/2023, 3/2016 (original)

This topic presents several categories of C/R solutions and identifies specific examples of software in each category. Note that the categories are not always mutually exclusive; there can be overlap. C/R solutions evolve over time; for a preview of potential future directions, follow the efforts to establish a Checkpoint-Restart Interface Standard for HPC.

No matter what kind of C/R you end up using, please test it first! Preferably you will do this on a small dataset as there will likely be some initial kinks you can work out rather easily. You don't want to start 1000 processes or wait hours to find out your checkpoint causes a program crash or, even worse, to find out that your restore doesn't work when you need it to!

Objectives

After you complete this segment, you should be able to:

  • Distinguish among the types of C/R and their pros and cons
  • Relate C/R solutions to different levels of software components
Prerequisites

A basic understanding of Linux and some programming language is the only requirement this topic. For instance, you may be accustomed to doing analyses in R or MATLAB. The principles taught in this topic will still be useful; most of the techniques require very little programming experience.

Additional Checkpoint/Restart references may be helpful but are not required for this roadmap.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement