Automatic Vectorization

Compilers employ a number of strategies to produce vector instructions from C/C++ or Fortran code. Although the topic is too complex to describe in its entirety, it is useful to conceptualize compiler vectorization in terms of loop unrolling. This captures the essence of what compilers are trying to do when presented with a vectorizable loop—even if it is not the actual process by which they do it.

for (i=0; i<N; i++) {
     a[i] = b[i] + c[i];
}
Arrow pointing down
for (i=0; i<N; i+=4) {
     a[i+0] = b[i+0] + c[i+0];
     a[i+1] = b[i+1] + c[i+1];
     a[i+2] = b[i+2] + c[i+2];
     a[i+3] = b[i+3] + c[i+3];
}
Arrow pointing down
for (i=0; i<N; i+=4) {
         Load b(i..i+3)
                  Load c(i..i+3) 
     Operate b+c → a
     Store a
}
Conceptualizing vectorization
via loop unrolling

Imagine a typical HPC code containing a very simple for loop that merely adds two large arrays of data of size N, as in the example code shown here. To make things as simple as possible, suppose that N is evenly divisible by 4, and suppose further that the CPU's vector unit can accommodate 4 values of the type present in the arrays (e.g., double- or single-precision floats). Given this vector length, we imagine the compiler unrolling the loop into 4 statements, corresponding to 4 consecutive array elements, while adjusting the loop counter to increment by 4 on each iteration. The compiler can then merge the four additions into a single SIMD operation on vectors of length four. Thus, in each iteration of the compiled loop, one vector operation computes four iterations of the original loop.

As the compiler proceeds through the source code, it attempts to determine if each loop it encounters can be "unrolled by 4" so it can be executed on SIMD hardware with a vector length of 4 elements (in this example). The compiler must prove to itself in each case that computing the loop in SIMD "chunks" produces exactly the same results as computing the loop sequentially without vectorization. As we will see in a later section on data dependencies, sometimes this is impossible, due to dependencies within or between the iterations.

Where vectorization is possible, the benefit becomes even greater when additions are paired with multiplications in the computed expression. Then a special instruction known as a fused multiply-add (FMA) will effectively do both a multiple and an add in one cycle. This produces an even better vector speedup.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement