Types of CR illustrated in relation to the software component hierarchy: from the OS to individual processes. The Application level is highlighted.
Types of CR illustrated in relation to the software component hierarchy: from the OS to individual processes. The Application level is highlighted.

Application-level C/R solutions record the current state of the application's data in a file written to disk. It is up to the programmer to select a file format, include the necessary book-keeping information, and write the necessary functionality so that the program can load the file and resume computation. Formats like HDF5 and NetCDF are language-agnostic and targeted towards HPC. Most application-level C/R solutions can automatically generate metadata and save it with the checkpoint, which can be valuable for both software and human consumption.

Application level C/R works best then you have control over how your data is stored. In contrast, process-level C/R solutions, like DMTCP, are ideal when the core of your simulation's state is being stored in a 3rd party program over which you have little control. If you create your own application-level C/R, you can modify a program and resume computation, allowing you to fix certain kinds of bugs or extend features — with the caveat that the state format is preserved between program changes or mapped to a data structure used for the new state. Editing a checkpoint and resuming it is not currently supported in process-level C/R , unless you are comfortable using a hex editor on your checkpoint file! That isn't to say there won't be somewhat friendlier tools for doing so in the future.

Another big advantage of application-level C/R is that it allows you to use a different number of processes after a restart; this is currently unsupported in process-level C/R.

Advantages and disadvantages of using application-level checkpointing
Pros
  • Very low over-head
  • Few surprises if done properly
Cons
  • Needs thorough testing for each application
  • At least moderate additional development time
  • There is a chance something is missed
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement