Video Transcript
Key points:
  • VTune can make use of specific compiler options to identify code hotspots (i.e., routines that take the largest fraction of run time). These hotspots can then be targeted with highest priority for performance optimization. On Stampede2, this command line will collect hotspot information and store it in a specified directory (e.g., "./hotspots_results"):
    mpiexec -n 1 amplxe-cl -collect hotspots -r ./hotspots_results ./main.out
  • In the command line above, ./main.out is the application code, the amplxe-cl argument is used for running this in batch on the command line (cl), and everything is to be executed in batch via MPI with just 1 rank. (To analyze the collected results later, you typically run VTune in a GUI with the amplxe-gui command.)
  • VTune indicates that the Legendre transform function tra_qst2rp takes up the largest fraction of the computation time, and that the next two most costly functions call that function. gprof-based analysis carried out by the first speaker in Part 1 of the webcast confirmed that tra_qst2rp was the dominant computational routine. The speaker heretofore refers to tra_qst2rp as the "bottleneck function" for the remainder of the presentation.
  • Intel Advisor is a different tool that runs compiled code through an emulator to collect more precise statistics on how the code is traversed during actual execution. Here, it is used to produce a loop survey:
    mpiexec -n 1 advixe-cl -collect survey --project-dir vectorization_profile ./main.out
  • The loop survey digs down in greater detail to provide information on various loop metrics:
    • Contribution to Run Time
    • Vectorization Efficiency
    • Performance (GFLOPS)
    • Arithmetic Intensity (FLOP/byte)
    • Clues to Performance Issues
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement