Introduction
A number of tools are available to assist you in identifying trouble spots in your code and tuning it so that it runs well on advanced clusters. We'll want to take a look at a few of these tools now, especially those that are available on Frontera and Stampede3, so you can consider whether they might help you to gain insight into your code's performance roadblocks. The tools we will highlight are the following.
- Optimization Reports - given the right options, the Intel and GCC compilers generate info on:
        - Vectorization (whether loops were or were not vectorized, and why)
- OpenMP parallelization
- Inlining of function calls
- Loop transformations that were made to improve vectorization or cache reuse
 
- Intel Advisor
        - Assists with improving vectorization, optimization, and shared-memory threading
- Works with C, C++, C#, and Fortran source codes
- Supports serial, threaded, and MPI applications
 
- Intel VTune Profiler
        - Profiles the performance of serial, threaded, MPI, and hybrid codes
- Collects statistics on where time is being spent while code is running
- Samples hardware counters at regular intervals, correlates them with lines of source code
 
As noted, these tools support the analysis and tuning of parallel codes using MPI. Additional documentation and tutorials on accessing these features are available at the following pages:
- Intel Advisor User Guide: Analyze MPI Applications
- Intel VTune Profiler User Guide: MPI Code Analysis
- Intel VTune Profiler Performance Analysis Cookbook: Profiling MPI Applications
Intel has enhanced all of these tools in recent years so they do more than produce reams of obscure reports, or dumps of raw measurements; they actually try to interpret and diagnose the results for you. Furthermore, each one gives you the ability to "drill down" to get additional details about the top-level areas of concern. It is especially convenient to do this in the GUI of Advisor or VTune. If you'd like to use a GUI on either Frontera or Stampede3, you might want to establish Remote Desktop Access through a VNC connection.
To see some of these tools in action, you could examine our companion material on Case Study: Profiling and Optimization on Advanced Cluster Architectures. While that material specifically examines optimization of a particular application on the KNL nodes on the former Stampede2, the concepts are broadly applicable to advanced cluster architectures, and the tools are suitable for use on other Intel processors such as SKX and CLX.
These are not the only nor the most accessible tools you can use. A couple of widely available ones are at least worth mentioning:
- Perf: gathers performance counter statistics, e.g., perf stat -d ./my.exegives counts of cache events (among others); seeman perf-stat
- REMORA: lets you look for excesses or imbalances in memory and I/O usage; on Stampede3 and Frontera, see module spider remora
For other common profiling techniques, see the Profiling and Debugging topic in the Virtual Workshop.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)