What is C/R?
C/R is a general methodology that enables software state to be saved to disk and resumed later. Think of the hibernate feature in Microsoft Windows or virtual machines like VirtualBox — if you've used this feature, then you have, in a way, used a type of C/R. Let's take a look at what's involved in C/R, both in terms of the components of the program state and in saving or loading the state.
Program state
- program memory / data structures
- open file descriptors
- open sockets
- process ids (PIDs)
- UNIX pipes
- shared memory segments
Difficulties encountered in C/R
- Some of the entities saved (listed above) require special permissions to restore (e.g., exactly restoring PIDs); not all C/R models can accommodate this, while others (namely virtual machines) get it for free.
- Some of the saved entities may be impossible to restore in certain contexts, such as sockets that have closed and cannot be re-established.
- For distributed processes, C/R must be coordinated across processes to guarantee safe handling of data in transit.
C/R isn't just about running software and being able to resume it later; there are other use cases that build on this core functionality. We've listed the most prominent use cases first and those that are on horizon last. Partly due to changing systems and the complexity of the systems involved, not all of the use cases presented below are easy to do. C/R is still a highly active area of research, and it is entirely likely that new use cases will arise in the coming years as solutions become more robust.
C/R Use Cases
- Recover and provide fault tolerance (restart after an error)
- Save a scientific, interactive session: R, MATLAB, IPython, etc.
- Obviate the need for long initialization times
- Migrate processes or even groups of virtual machines to other systems
-
Keep checkpoints that led to particular results or corner cases that document bugs
(for the ultimate in reproducibility) - Replay a checkpoint to verify reasonable levels of convergence or error
- Make an existing debugger jump backward in time
-
Interact with and analyze results of in-progress CPU-intensive processes through:
- temporary interrupts in an interpreted language
- attaching a checkpointed debugger in a compiled language
- Provide interactive or callback-based "exception handling" in languages that don't normally support exceptions