Kernel-modifying C/R
We've now introduced enough C/R concepts and solutions that we can talk about how the OS kernel can relate to certain components of C/R, as well as discuss how patching the kernel may influence your choice in a C/R solution.
The kernel can be thought of as an arbiter between processes running on a shared memory system. Any time there is contention for resources, the kernel is usually the part of the operating system that must deal with handling the resources, as well as isolating processes from each other. As we saw in the page on containers, saving and restoring process state can be complicated by needing to interact with the OS. The C/R solution must know how to create a process within the OS, of course, but it must also know how to restore application memory and entities associated with the program's file descriptors (most notably open files).
For restoring several processes — not an uncommon occurrence, as many applications require multiple processes — the situation becomes yet more complicated. Process IDs (PIDs) may need to be appropriately adjusted so that the restored processes can still communicate with each other, and PID assignment is ultimately determined by the OS kernel. Sockets and pipes are required for many types of communication. Although they are under the purview of the kernel, sockets and pipes have an API that allows for their creation in user space. PID restoration, on the other hand, does not have a user-space API, and recreation of PIDs would need to be virtualized in some way. The only alternative would be to use a kernel modification that exposes its own non-standardized and possibly controversial API for PID assignment1.
Many different types of C/R may modify the kernel in some way. Hypervisors used for VMs often require kernel modifications to improve performance or efficient pass-through of data from virtualized devices to the real, underlying hardware devices. And as we just discussed, process-level C/R and containers also need to know about the OS in order to recreate certain features of processes. However, as we'll see, kernel modifications are not always required, even in these cases.
Two process-level C/R examples are BLCR (Berkeley Lab Checkpoint/Restart) and CRIU (Checkpoint/Restore In Userspace). While there are other fundamental differences in how BLCR and CRIU function than their interaction with the kernel, it is interesting to note that CRIU — as its name implies — doesn't require kernel user facilities. This no-kernel-modification approach allowed the CRIU developers to get their patches accepted into the mainline Linux kernel. On the other hand, kernel-dependent C/R, exemplified by BLCR, limits which OS versions and production systems are supported. BLCR was quite mature, but the project is no longer developed. Due to the kernel dependency inherent in its design, BLCR is unable to support unmodified kernels and is incompatible with the latest kernels (at the time of this writing, the latest kernel supported was 3.7.1 and development had been paused for almost ten years).
It is worth mentioning that kernel modifications in Linux come in two varieties: kernel patches and loadable kernel modules (LKMs). Kernel patches modify the kernel itself, and require a recompiled kernel to function. LKMs on the other hand are object files that extend the functionality of a running kernel, typically by adding support for new system calls, hardware (i.e., LKMs are frequently device drivers), or file systems. While this is certainly less work from a deployment perspective, for our purposes in large HPC systems, an LKM is invasive as it modifies the running kernel and thus may affect the stability of the system. So, appropriate caution and testing should be observed before deploying such a solution to production systems. In addition, the compiled code for a loadable kernel module or patch must both be matched to the exact kernel that you are using. This is most easily supported as part of a full Linux distro, rather than by the individual program and programmer.
Advantages and disadvantages of using kernel-modifying checkpointing.
Pros- May give you features or optimizations not present in other solutions
- Required modification of the kernel
- May not work for all kernels (e.g. BLCR does not work past 3.7.1)
1. Actually, since Linux kernel 3.3, thanks to the CRIU project, this can be enabled in Linux kernels , and so the claim is no longer strictly true. However, whether or not CRIU's approach is the best solution remains a point of debate despite its acceptance into the mainline kernel. ^