When running a job for any substantial length of time, you hope that nothing interrupts it so that you won't have to start over. But no matter the environment, interruption is always a possibility. Checkpoint/Restart (C/R) solutions allow you to resume a job from approximately where it left off. Checkpointing strategies also have additional benefits beyond fault tolerance, so learning about these strategies is likely to benefit your own work.

This roadmap provides an overview of the checkpointing/restart strategy and a survey of different types of C/R solutions.

Objectives

After you complete this workshop, you should be able to:

  • List the various types of C/R solutions
  • Explain when to use a specific C/R solution
  • Identify the software commonly used for C/R
Prerequisites

A basic understanding of Linux and some programming language is the only requirement for the first half of this roadmap. For instance, you may be accustomed to doing analyses in R or MATLAB. The principles taught in this roadmap will still be useful; most of the techniques require very little programming experience.

Additional Checkpoint/Restart references may be helpful but are not required for this roadmap.

Requirements

There are no specific requirements for this course.

©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement