Cornell Virtual Workshop > Checkpointing > C/R Background

What is C/R?

C/R is a general methodology that enables software state to be saved to disk and resumed later. Think of the hibernate feature in Microsoft Windows or virtual machines like VirtualBox — if you've used this feature, then you have, in a way, used a type of C/R. Let's take a look at what's involved in C/R, both in terms of the components of the program state and in saving or loading the state.

Program state

program memory / data structures
open file descriptors
open sockets
process ids (PIDs)
UNIX pipes
shared memory segments

Difficulties encountered in C/R

Some of the entities saved (listed above) require special permissions to restore (e.g., exactly restoring PIDs); not all C/R models can accommodate this, while others (namely virtual machines) get it for free.
Some of the saved entities may be impossible to restore in certain contexts, such as sockets that have closed and cannot be re-established.
For distributed processes, C/R must be coordinated across processes to guarantee safe handling of data in transit.

C/R isn't just about running software and being able to resume it later; there are other use cases that build on this core functionality. We've listed the most prominent use cases first and those that are on horizon last. Partly due to changing systems and the complexity of the systems involved, not all of the use cases presented below are easy to do. C/R is still a highly active area of research, and it is entirely likely that new use cases will arise in the coming years as solutions become more robust.

C/R Use Cases

Recover and provide fault tolerance (restart after an error)
Save a scientific, interactive session: R, MATLAB, IPython, etc.
Obviate the need for long initialization times
Migrate processes or even groups of virtual machines to other systems
Keep checkpoints that led to particular results or corner cases that document bugs
(for the ultimate in reproducibility)
Replay a checkpoint to verify reasonable levels of convergence or error
Make an existing debugger jump backward in time
Interact with and analyze results of in-progress CPU-intensive processes through:
- temporary interrupts in an interpreted language
- attaching a checkpointed debugger in a compiled language
Provide interactive or callback-based "exception handling" in languages that don't normally support exceptions

Back