Cornell Virtual Workshop > Case Study: Profiling and Optimization on Advanced Cluster Architectures > Vectorization & Parallelization

Aligning Variables

Aligning loop variables, through a combination of both compiler flags and compiler directives added to code, can improve vectorization performance in key parts of the code.

Video Transcript

Key points:

The compiler can be instructed to allocate variables such that the base pointer is aligned on specific byte boundaries. This allows vectorization to load memory into registers without gaps at the head of the register, improving vector efficiency.
COMPFLAGS = -align array64byte
This Intel compiler flag only addresses alignment of dynamically allocated data, however. For static data in common blocks or fields within derived datatypes, a compiler flag is insufficient, and compiler directives indicating memory alignment must be inserted into the code where variables are declared, e.g.,
double complex, private :: a3c_Y(i_Th, 0:2*_pHc-1,i_pN,3) !DIR$ ATTRIBUTES ALIGN: 64:: a3c_Y
With these alignment optimizations, an updated vectorization report indicates aligned memory access, with an estimated speedup that has increased from 5.83 to 7.49 as a result of alignment. In addition, loop analytics and a roofline plot for the optimized code indicate improved performance: several loops in the code have now moved above the L2 cache bandwidth roofline, although the primary bottleneck loop still lies below that roofline.
Now that some of the memory bottlenecks have been relaxed, further performance enhancements might be forthcoming through additional thread parallelization via OpenMP.

Back