Although a compiler may use vector instructions virtually anywhere (indeed, some implementations of printf() use vector instructions!), the vectorization relevant to HPC performance occurs in loops: most commonly for loops in C/C++, and do loops in Fortran. For a compiler to express a computation within a loop in terms of vector instructions, the loop should meet certain criteria. Loops that do not meet these criteria may not be vectorizable at all.

The basic criteria that apply to vectorizable loops are listed below. In the past, these properties were strict requirements, but compiler technology has matured to the point that some of these properties no longer absolute requirements (as noted). Still, it is helpful to know the characteristics that naturally lead to successful loop vectorization.

The loop must be countable at runtime.
The number of loop iterations does not need to be known statically at compile time, but the total number of loop iterations must be knowable immediately before the loop executes. In practice, this means that the loop cannot have conditional terminations such as break statements. For this reason for/do loops are more likely to meet this requirement in a straightforward manner than while loops.
There should be a single control flow within the loop.
Vector instructions typically encompass moving data and performing computations. Simple for loops that contain straightforward computations and nothing else are relatively easily mapped to vector instructions. But branching and conditionals often cannot be represented as vector instructions, so if or switch statements within a loop may prevent it from being vectorized.
In certain circumstances, however, compilers can work around this limitation by representing simple conditionals as masked assignments. In this scenario, the compiled code follows both branches of the conditional (the then and the else clauses) using vector instructions. The branching condition is also computed in a vectorized manner to yield a vector of boolean values: this is the mask. The mask is applied during the final assignment in order to determine which of the results to keep from each branch. Even given the masking trick, loops with if or switch statements are at increased risk of failing to vectorize.
The loop should not contain function calls.
A function call is implemented in a series of complicated steps: creating a new stack frame, pushing variables onto a stack, jumping to a location in memory, and executing whatever instructions are present at that memory location. None of these are vector operations, so the presence of a function call will generally disqualify a loop from vectorization.
However, if a high enough level of optimization is in effect, the compiler may be able to inline a function, meaning that the source lines of the function are effectively copied into the body of the loop. This permits the compiler to attempt to vectorize the function line by line, along with the rest of the loop.
There is another exception: function calls that the compiler is able to replace with inline vector instructions found in a library. For Intel compilers, such functions include intrinsic math functions like sin(), sqrt(), pow(), and others from the Intel Vector Math Library. GCC offers a similar capability if the underlying version of glibc in the OS is 2.22 or higher, so that libmvec is present. (This is not the case on Stampede2, as one can verify by executing ldd --version.)
Arrays in the loop should not use indirect indexing.
The presence of indirect memory accesses such as a[b[i]] makes it difficult or impossible for the compiler to determine if assignments into array a are truly independent, so that they can be safely done in parallel. Furthermore, as we will see later, vectorization works best when the data are aligned and sequential in memory, not accessed randomly.

In the case of nested loops, the compiler will normally try to vectorize only the innermost loop. Note, however, that the compiler may be able to rearrange the order of the loops as an optimization step.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement