Skip to main content


Introduction

When running a job that takes any substantial length of time, you hope that nothing interrupts it so that you won't have to start over. But no matter the environment, interruption is always a possibility. Checkpoint/Restart (C/R) solutions allow you to resume a job from approximately where it left off. C/R solutions may either use an external tool to save an entire program's state, or use custom code or libraries to save only the required components of program state.

A diagram of C/R. The following icons from http://thenounproject.com are used in
the diagram and are distributed under the Creative Commons Attribution License 3.0:
Happy, by Julien Deveaux; Dead, by Julien Deveaux; Upload, by Philipp Süß; Save,
by Jevgeni Striganov

The need for this fault-tolerances often increases in high performance computing, but HPC brings unique challenges to C/R implementers. It almost always involves concurrent programming, and there are considerations for both shared- and distributed-memory applications. Recently, robust C/R solutions have begun to emerge to tackle these areas. This is great news, since branches of concurrent computing are used in jobs most likely to benefit from C/R: massive, long-running jobs where the cost of loss of work in time and money is very large.

In this module we'll explore several categories of C/R solutions and go into examples and exercises of some implementations that are particularly well-suited for high-performance computing. As we will see, checkpointing has many additional benefits beyond fault tolerance. For this reason, it can be highly advantageous to learn about checkpointing and to think about how it may benefit your own work.

Brandon Barker
Cornell Center for Advanced Computing
March 2016