VTune can be configured to get more detailed information about where precisely in the code back end stalls are taking place.

Video Transcript
Key points:
  • The following command line instructs VTune to initiate a memory access analysis for the dynamo code:
    mpiexec -n 128 amplxe-cl -collect memory-access -r ./vtune_memory_analysis ./main.out
  • The code in question is suffering from a substantial number of L2 cache misses.
  • One strategy for resolving some of these memory bottlenecks is to increase prefetch aggressiveness, which will aim to bring more data into L2 ahead of expected execution. In some cases, prefetch aggressiveness can backfire if predictions do not match actual L2 cache loads. Prefetch aggressiveness can be increased by adding the following to the Intel compiler flags:
    COMPFLAGS = -qopt-prefetch=3
  • With greater prefetching (the default value is 2), L2 miss counts drop by almost a factor of 50, and cycles wasted waiting to service L2 hits and misses also drop considerably, with the performance of the key loop in the bottleneck function increasing from 3.8 GFLOPS to 5.83 GFLOPS. Other loops also have sped up, as confirmed by an updated Loop Survey and Roofline Analysis.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement