Cornell Virtual Workshop > Case Study: Profiling and Optimization on Advanced Cluster Architectures > Vectorization & Parallelization

Making effective use of the vector units on the KNL nodes is an important element in attaining peak performance on Stampede2.

Performance analysis reveals that vectorization efficiency in the bottleneck loops is only 60-70%, indicating an opportunity for further speedup.
Key vectorization metrics are:
- Gain = Scalar Loop Cost (cycles) / Vector Loop Cost (cycles)
- Vector efficiency = Gain/Vector_Length(for data type)
The vector efficiency can be hindered by overhead that breaks up the vector pipeline, such as conversion between scalar and vector operations.
Using array-level syntax to imply loops over array indices can assist compilers in optimizing vector pipelines, e.g., as in the target code's Fortran90 operation at the bottleneck loop (line 325):
a3c_Y(:,m,n,2) = a3c_Y(:,m,n,2) + c1 * leg_llP_(:,nh) + c2 * leg_llsin1mP(:,nh)
To produce a vectorization report (which will be stored in a file with suffix .optrpt), add the following Intel compiler flag:
COMPFLAGS = -qopt-report5
Note: specifying the compiler flag above as -qopt-report=5 is also valid, and different numerical values can be chosen to produce different amounts of reporting.
Examination of the vectorization report indicates that loop performance is hindered due to arrays whose locations in memory are unaligned with 64 byte memory boundaries, and that a significant potential speedup can be achieved by aligning those variables.
If you want to learn more about vectorization and how to enable it in your code, please refer to Vectorization in the Cornell Virtual Workshop.