Hotspot Identification
Key points:
-
VTune can make use of specific compiler options to identify code hotspots (i.e., routines that take the largest fraction of run time). These hotspots can then be targeted with highest priority for performance optimization. On Stampede2, this command line will collect hotspot information and store it in a specified directory (e.g., "./hotspots_results"):
mpiexec -n 1 amplxe-cl -collect hotspots -r ./hotspots_results ./main.out
- In the command line above, ./main.out is the application code, the amplxe-cl argument is used for running this in batch on the command line (cl), and everything is to be executed in batch via MPI with just 1 rank. (To analyze the collected results later, you typically run VTune in a GUI with the amplxe-gui command.)
- VTune indicates that the Legendre transform function
tra_qst2rp
takes up the largest fraction of the computation time, and that the next two most costly functions call that function. gprof-based analysis carried out by the first speaker in Part 1 of the webcast confirmed thattra_qst2rp
was the dominant computational routine. The speaker heretofore refers totra_qst2rp
as the "bottleneck function" for the remainder of the presentation. -
Intel Advisor is a different tool that runs compiled code through an emulator to collect more precise statistics on how the code is traversed during actual execution. Here, it is used to produce a loop survey:
mpiexec -n 1 advixe-cl -collect survey --project-dir vectorization_profile ./main.out
-
The loop survey digs down in greater detail to provide information on various loop metrics:
- Contribution to Run Time
- Vectorization Efficiency
- Performance (GFLOPS)
- Arithmetic Intensity (FLOP/byte)
- Clues to Performance Issues