How Vectorization Works
Vectorization is a process by which mathematical operations found in loops in scientific code are executed in parallel on special vector hardware found in CPUs and coprocessors. A "vector" is a contiguous set of data of a uniform type, usually floating point numbers. Each number in the vector is called an element. In vectorized code, basic operations such as addition and multiplication are performed on pairs of small, fixed-sized vectors of numerical values. Corresponding elements in the two small vectors are processed simultaneously by the vector processing unit. The net effect of vectorization is a speedup in floating-point or integer computations which ideally is equal to the length of the vectors (number of elements processed simultaneously).
Many compilers vectorize code automatically as part of their code optimization process. This process, however, is not perfect. Certain code constructs can make it difficult or impossible for the compiler to properly assess if floating-point intensive loops can be vectorized. Also, inefficient use of cache and memory can have a negative impact on performance increases obtained by vectorization. As vector lengths have grown in modern CPUs, particularly in Intel's latest Xeon architectures beginning with the Xeon Phi (MIC), there is more performance to gain from vectorizing code, and greater penalty for failing to vectorize. This can make it worthwhile to put some amount of effort into removing obstacles that inhibit vectorization, particularly in code sections that consume a lot of time.
This tutorial discusses vectorization from three perspectives:
- Hardware Perspective
- Run vector instructions involving special registers and functional units that allow in-core parallelism for vector operations on arrays of data.
- Compiler Perspective
- Determine how and when it is possible to express computations in terms of vector instructions. This occurs at certain optimization levels. Compilers are imperfect at vectorizing code.
- User Perspective
- Write code in a manner that allows the compiler to deduce that vectorization is possible. This allows the compiler to generate vector instructions that can take advantage of hardware parallelism.
If you're looking for a "quick start" on how vectorization works in a modern processor, and how to make sure your compiler is helping your code to take advantage of it, then the first two perspectives may be sufficient for you. This Introduction, plus the two topics titled Vector Hardware and Vectorizing with Compilers, should be enough to give you the basics.