Post-mortem debugging involves inspecting the state of an application that has crashed or failed in some way, in order to determine the conditions that led to the failure. Like runtime debugging, both symbolic debugging with a debugger and ad-hoc print statements are useful for analyzing an application after it fails. There are, however, some additional considerations to keep in mind.

When debugging via print or logging statements, post-mortem analysis of a failed application follows the previously described ad-hoc workflow:

  • Enabling logging at the lowest ("debug") level of priority, or adding appropriate print statements
  • Running the application until it fails
  • Reading through the logs to try to reconstruct what happened

However, this method may not provide enough detail. If an application terminates abnormally, most operating systems can be configured to dump the entire contents of program memory into a file known as a "core dump". Symbolic debugging applications like GDB can then be used to inspect the contents of the dump to determine the state of the instruction stack, as well as the values of variables in memory. Compiling an application with -g so it includes a symbol table is once again helpful in this regard.

In HPC, data-intensive applications can consume extraordinary amounts of memory. As a result, most HPC systems are configured not to dump the (potentially very large) core into a file. As a result, on Linux systems such as Frontera, the desired behavior needs to be enabled by ulimit -c:

ulimit -c determines the maximum amount of memory the operating system is allowed to dump to a file.

Like the ad-hoc logging case, the typical workflow is:

  • Allowing core dumps via ulimit -c
  • Running the application until it fails
  • Pointing a debugger to the core file, and inspecting the variables and execution stack in memory (with gdb, say).
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement