Vector Efficiency
Making effective use of the vector units on the KNL nodes is an important element in attaining peak performance on Stampede2.
Key points:
- Performance analysis reveals that vectorization efficiency in the bottleneck loops is only 60-70%, indicating an opportunity for further speedup.
-
Key vectorization metrics are:
- Gain = Scalar Loop Cost (cycles) / Vector Loop Cost (cycles)
- Vector efficiency = Gain/Vector_Length(for data type)
- The vector efficiency can be hindered by overhead that breaks up the vector pipeline, such as conversion between scalar and vector operations.
-
Using array-level syntax to imply loops over array indices can assist compilers in optimizing vector pipelines, e.g., as in the target code's Fortran90 operation at the bottleneck loop (line 325):
a3c_Y(:,m,n,2) = a3c_Y(:,m,n,2) + c1 * leg_llP_(:,nh) + c2 * leg_llsin1mP(:,nh)
-
To produce a vectorization report (which will be stored in a file with suffix .optrpt), add the following Intel compiler flag:
COMPFLAGS = -qopt-report5
- Note: specifying the compiler flag above as -qopt-report=5 is also valid, and different numerical values can be chosen to produce different amounts of reporting.
- Examination of the vectorization report indicates that loop performance is hindered due to arrays whose locations in memory are unaligned with 64 byte memory boundaries, and that a significant potential speedup can be achieved by aligning those variables.
- If you want to learn more about vectorization and how to enable it in your code, please refer to Vectorization in the Cornell Virtual Workshop.
©
|
Cornell University
|
Center for Advanced Computing
|
Copyright Statement
|
Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)