Cornell Virtual Workshop > Checkpointing > C/R Background

Ad Hoc Solutions

When does it make sense to create your own C/R solution, that is, one which is tailored to your application or script? Ad hoc solutions may be acceptable or even desirable in cases that are extremely simple. For instance, embarrassingly parallel programs that may be restarted on any section of data may qualify. This may include saving results every N iterations for some appropriately sized N. Alternatively, it may be that you want to save data after a particular stage of the analysis has concluded, as that is a natural synchronization point. Still, it involves some work on the part of the researcher to remember which data points have already been computed and to calculate which remaining data points must still be processed.

If the software is simple enough, you may get something satisfactory working after only a few tries, but it can be frustrating if the code meant to save your data causes your software to crash instead! This hints at the fact that creating your own checkpoint solution is perhaps not always ideal. In addition to the possibility of introducing new bugs, each section of a computational pipeline would need different save and restore procedures unless one has taken the precaution of creating a global state structure that can be checkpointed at any point in the pipeline. If you are using synchronization points to save data (as in the example listed below), the run time between these checkpoints may be highly variable or very long. In turn, each of these procedures takes time to code, and it is always possible you will later discover that you have missed saving some critical data necessary for restarting. Briefly, the difficulty of embedding the C/R mechanism in the code depends upon multiple factors — the amount of re-engineering that may be required in the code, the amount of parallelism in the code, the preference for writing binary or text data as checkpoint, the preference for using parallel I/O or not, identification of critical variables etc. We discuss manual implementation of C/R in the section on application-level C/R, which is basically a more rigorous and standard way of doing ad-hoc C/R. It is worth considering the trade-offs of ad-hoc C/R with that of testing a C/R solution, as some C/R solutions do not take long to get going.

Let's summarize the advantages and disadvantages of using an ad-hoc style of C/R:

Pros

Convenient when there are many independent items to be processed

Cons

Takes time to code
May miss some data
Have to write custom restore code
Different parts of the program may need different save and restore procedures, resulting in additional programming effort

Back