Cornell Virtual Workshop > Checkpointing > Types of C/R

Application-level C/R

Types of CR illustrated in relation to the software component hierarchy: from the OS to individual processes. The Application level is highlighted.

Application-level C/R solutions record the current state of the application's data in a file written to disk. It is up to the programmer to select a file format, include the necessary book-keeping information, and write the necessary functionality so that the program can load the file and resume computation. Formats like HDF5 and NetCDF are language-agnostic and targeted towards HPC. Most application-level C/R solutions can automatically generate metadata and save it with the checkpoint, which can be valuable for both software and human consumption.

Application level C/R works best then you have control over how your data is stored. In contrast, process-level C/R solutions, like DMTCP, are ideal when the core of your simulation's state is being stored in a 3rd party program over which you have little control. If you create your own application-level C/R, you can modify a program and resume computation, allowing you to fix certain kinds of bugs or extend features — with the caveat that the state format is preserved between program changes or mapped to a data structure used for the new state. Editing a checkpoint and resuming it is not currently supported in process-level C/R , unless you are comfortable using a hex editor on your checkpoint file! That isn't to say there won't be somewhat friendlier tools for doing so in the future.

Another big advantage of application-level C/R is that it allows you to use a different number of processes after a restart; this is currently unsupported in process-level C/R.

Advantages and disadvantages of using application-level checkpointing

Pros

Very low over-head
Few surprises if done properly

Cons

Needs thorough testing for each application
At least moderate additional development time
There is a chance something is missed

Back