Process-level C/R
You might think it would be sufficient to save the states of all the processes associated with a computational task, but these processes also have access to OS-level resources that are assigned to them as they are running. Therefore, processes cannot simply be saved as independent entities: some OS info must also be included. However, Linux does not provide all the APIs necessary to recreate the processes' OS context exactly. This means the kernel itself must be modified to give the C/R software access to everything it needs from the OS.
In the section on containers in C/R, we noted that container-based C/R solutions must be capable of understanding and restoring various characteristics of a set of processes. The primary difference for process-level C/R is that we aren't using an existing full-blown container solution in conjunction with the C/R software. Nonetheless, multiple processes may still be checkpointed without using containers. As we'll see later, unlike existing container solutions, multiple processes on distributed systems can even utilize process-level C/R. Certainly, process-level C/R solutions have the lowest overhead.
The oldest example of a prevalent process-level C/R solution that is still relevant is Berkeley Lab Checkpoint/Restart (BLCR). BLCR is nt longer updated and will not work with recent versions of Linux. Current popular options include CRIU (typically used with containers) and DMTCP.
Given the current state of the art, process-level C/R will be the best option for HPC. if the particular system can support it. It avoids the perils and problems of kernel-C/R and the overhead of containers and especially VMs. However, as we saw, process-level C/R solutions also sometimes modify the kernel. In the next section, we'll introduce a solution that does not require any form of kernel modification.
Advantages and disadvantages to using process-level checkpointing
Pros- Usually simple to use
- Low overhead
- May have surprises; applications use different advanced feature sets (e.g. IPC), and each solution will have a different feature set. Test first!
- BLCR requires modification of application for static linking
- DMTCP static linking support is experimental
- CRIU is a bit new