Aligning loop variables, through a combination of both compiler flags and compiler directives added to code, can improve vectorization performance in key parts of the code.

Video Transcript
Key points:
  • The compiler can be instructed to allocate variables such that the base pointer is aligned on specific byte boundaries. This allows vectorization to load memory into registers without gaps at the head of the register, improving vector efficiency.
    COMPFLAGS = -align array64byte
  • This Intel compiler flag only addresses alignment of dynamically allocated data, however. For static data in common blocks or fields within derived datatypes, a compiler flag is insufficient, and compiler directives indicating memory alignment must be inserted into the code where variables are declared, e.g.,
    double complex, private :: a3c_Y(i_Th, 0:2*_pHc-1,i_pN,3)
    !DIR$ ATTRIBUTES ALIGN: 64:: a3c_Y
  • With these alignment optimizations, an updated vectorization report indicates aligned memory access, with an estimated speedup that has increased from 5.83 to 7.49 as a result of alignment. In addition, loop analytics and a roofline plot for the optimized code indicate improved performance: several loops in the code have now moved above the L2 cache bandwidth roofline, although the primary bottleneck loop still lies below that roofline.
  • Now that some of the memory bottlenecks have been relaxed, further performance enhancements might be forthcoming through additional thread parallelization via OpenMP.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement