Types of CR illustrated in relation software component hierarchy: from the OS to individual processes. VM selected.
Types of CR illustrated in relation software component hierarchy: from the OS to individual processes. VM selected.

While virtualization is really its own field, it is conceptually related to C/R and is a good place to start when discussing various kinds of C/R. The fundamental unit in virtualization is a virtual machine (VM), which is an operating system running inside of a virtualization framework that provides an abstract view of system hardware. Sometimes direct access to the host's hardware is allowed, and sometimes this access is processed by the host OS.

Below, you can see an example of a free, desktop-oriented solution, VirtualBox, running a graphical session of Ubuntu Linux.1 In this case, Windows 8 is the host operating system (the OS that VirtualBox is running in, which is running directly on the hardware), but it could be any OS supported by VirtualBox, including Linux, macOS, Windows, and Solaris. By clicking on the "Machine" menu and selecting "Pause", a user can save the state of the entire guest operating system in memory. VirtualBox also supports saving the state to disk. The saved state can restored and resumed at a later time so it can serve as a C/R solution.

A screenshot of Ubuntu Linux running in VirtualBox on Windows 8.
A screenshot of Ubuntu Linux running in VirtualBox on Windows 8.

Virtualization has many advantages for reproducibility in science, such as improving work-flows, general convenience, and ease of use, but there are some downsides as well. While saving the entire machine state is easy and supports many types of applications, the strategy has disadvantages for HPC applications that tend to run on distributed systems with nodes that are potentially concurrently used by multiple users. Multi-VM C/R is still a challenge and would incur a lot of overhead (time to save and restore large system state, storing of system state and VM images, etc.), particularly if one is working with a large number of VMs. The VM context requires a predefined partitioning of system memory and CPU resources, which makes the system relatively inflexible for scenarios where many applications from different users may be starting and stopping on the same nodes (such as in HPC).

Implementations of VM C/R include practically any hypervisor; if it can do virtualization, it can very likely save and restore the system state. You snapshot the hardware state and the memory, and that's basically it! Popular implementations include KVM, VirtualBox, and VMware, although there are many other hypervisors and emulators2 available; for certain niches these may not be the best option. KVM and VirtualBox are both open source, but the former is a bit more lightweight, so it tends to be used on servers, and does not currently have as nice of a desktop interface or support compared to VirtualBox. VMware offers a family of commercial solutions used on both desktop and server systems. In the server or cloud computing environment, an interesting application of C/R is to allow live migration of VMs from one host to another; the only observable effect of such a migration may be a momentary pause in service!

Advantages and disadvantages to using virtual machines for checkpointing
Pros
  • Very simple to use
  • Very few surprises
  • Most applications supported
Cons
  • Requires predefined partitioning of RAM and CPU resources
  • More overhead in most categories (storage of VM image, RAM snapshot, etc.)
  • Still a challenge for multi-VM C/R

1. If you have never used a virtual machine before, you are encouraged to try installing a Linux distribution inside of VirtualBox. Aside from giving you firsthand experience with virtualization, this will give you a convenient way to access a Linux environment on your non-Linux system (assuming you aren't using Linux as your host OS). Try to tweak various machine settings as well to give you an idea of what level of abstraction virtual machines deal with. ^

2. Emulators are like hypervisors, but are typically slower, as they translate the guest VM's instruction set to the host CPU's instruction set, allowing you to work with different architectures on a single system. Emulators are most commonly used for development, or for convenient access to applications designed for older systems. ^

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement