Having demonstrated improved loop performance on individual KNL nodes through the optimizations discussed above, we can revisit the parallel scaling performance of the code on Stampede2.

Video Transcript
Key points:
  • Peak performance has improved by 18% (dropping from 0.233 seconds/iteration to 0.191 seconds/iteration).
  • Alignment of static variables was only implemented for one loop in the code (line 325 of the primary bottleneck function), but additional alignment directives can be added elsewhere in the code and should result in additional performance gains.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement