Vector Efficiency
Making effective use of the vector units on the KNL nodes is an important element in attaining peak performance on Stampede2.
Key points:
- Performance analysis reveals that vectorization efficiency in the bottleneck loops is only 60-70%, indicating an opportunity for further speedup.
-
Key vectorization metrics are:
- Gain = Scalar Loop Cost (cycles) / Vector Loop Cost (cycles)
- Vector efficiency = Gain/Vector_Length(for data type)
- The vector efficiency can be hindered by overhead that breaks up the vector pipeline, such as conversion between scalar and vector operations.
-
Using array-level syntax to imply loops over array indices can assist compilers in optimizing vector pipelines, e.g., as in the target code's Fortran90 operation at the bottleneck loop (line 325):
a3c_Y(:,m,n,2) = a3c_Y(:,m,n,2) + c1 * leg_llP_(:,nh) + c2 * leg_llsin1mP(:,nh)
-
To produce a vectorization report (which will be stored in a file with suffix .optrpt), add the following Intel compiler flag:
COMPFLAGS = -qopt-report5
- Note: specifying the compiler flag above as -qopt-report=5 is also valid, and different numerical values can be chosen to produce different amounts of reporting.
- Examination of the vectorization report indicates that loop performance is hindered due to arrays whose locations in memory are unaligned with 64 byte memory boundaries, and that a significant potential speedup can be achieved by aligning those variables.
- If you want to learn more about vectorization and how to enable it in your code, please refer to Vectorization in the Cornell Virtual Workshop.