Cornell Virtual Workshop > Case Study: Profiling and Optimization on Advanced Cluster Architectures > Vectorization & Parallelization

Parallel Scaling

Having demonstrated improved loop performance on individual KNL nodes through the optimizations discussed above, we can revisit the parallel scaling performance of the code on Stampede2.

Video Transcript

Key points:

Peak performance has improved by 18% (dropping from 0.233 seconds/iteration to 0.191 seconds/iteration).
Alignment of static variables was only implemented for one loop in the code (line 325 of the primary bottleneck function), but additional alignment directives can be added elsewhere in the code and should result in additional performance gains.

Back