Making effective use of the vector units on the KNL nodes is an important element in attaining peak performance on Stampede2.

Video Transcript
Key points:
  • Performance analysis reveals that vectorization efficiency in the bottleneck loops is only 60-70%, indicating an opportunity for further speedup.
  • Key vectorization metrics are:
    • Gain = Scalar Loop Cost (cycles) / Vector Loop Cost (cycles)
    • Vector efficiency = Gain/Vector_Length(for data type)
  • The vector efficiency can be hindered by overhead that breaks up the vector pipeline, such as conversion between scalar and vector operations.
  • Using array-level syntax to imply loops over array indices can assist compilers in optimizing vector pipelines, e.g., as in the target code's Fortran90 operation at the bottleneck loop (line 325):
    a3c_Y(:,m,n,2) = a3c_Y(:,m,n,2) + c1 * leg_llP_(:,nh) + c2 * leg_llsin1mP(:,nh)
  • To produce a vectorization report (which will be stored in a file with suffix .optrpt), add the following Intel compiler flag:
    COMPFLAGS = -qopt-report5
  • Note: specifying the compiler flag above as -qopt-report=5 is also valid, and different numerical values can be chosen to produce different amounts of reporting.
  • Examination of the vectorization report indicates that loop performance is hindered due to arrays whose locations in memory are unaligned with 64 byte memory boundaries, and that a significant potential speedup can be achieved by aligning those variables.
  • If you want to learn more about vectorization and how to enable it in your code, please refer to Vectorization in the Cornell Virtual Workshop.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement