Application-level C/R
Application-level C/R solutions record the current state of the application's data in a file written to disk. It is up to the programmer to select a file format, include the necessary book-keeping information, and write the necessary functionality so that the program can load the file and resume computation. Formats like HDF5 and NetCDF are language-agnostic and targeted towards HPC. Most application-level C/R solutions can automatically generate metadata and save it with the checkpoint, which can be valuable for both software and human consumption.
Application level C/R works best then you have control over how your data is stored. In contrast, process-level C/R solutions, like DMTCP, are ideal when the core of your simulation's state is being stored in a 3rd party program over which you have little control. If you create your own application-level C/R, you can modify a program and resume computation, allowing you to fix certain kinds of bugs or extend features — with the caveat that the state format is preserved between program changes or mapped to a data structure used for the new state. Editing a checkpoint and resuming it is not currently supported in process-level C/R , unless you are comfortable using a hex editor on your checkpoint file! That isn't to say there won't be somewhat friendlier tools for doing so in the future.
Another big advantage of application-level C/R is that it allows you to use a different number of processes after a restart; this is currently unsupported in process-level C/R.
Advantages and disadvantages of using application-level checkpointing
Pros- Very low over-head
- Few surprises if done properly
- Needs thorough testing for each application
- At least moderate additional development time
- There is a chance something is missed