Vector operations are a form of SIMD parallelism. The amount of parallelism in a vector operation is theoretically proportional to the number of bits in a hardware vector register. But because real-world code always contains some serial (non-vector) instructions, the overall performance increase due to vectorization is always less than the theoretical speedup of the vector operations themselves.

Amdahl's law sets an upper limit to the speedup that is possible. Given a code in which only a percentage of the workload is vectorizable, it says that the speedup curve tends to flatten out as the vector width is increased. Let's say we have three such codes, which are vectorizable to different degrees, and we compile and run these codes on vector hardware of increasing size (where the clock rate of the CPU is the same in each case). The results would look like this:

Amdahl's Law applied to vectorization: three curves showing the maximum obtainable speedup as a function of vector width for three different percentages of vectorized workload
Performance of vectorized code as predicted by Amdahl's Law

As the figure shows, if a code is only partially vectorized, its maximum speedup might fall well below N for a vector width of N elements. (Note, the scales of the axes are logarithmic.) The exact upper bound depends significantly on the percentage of code that is vectorized. Poorly vectorized code hardly benefits from wide vectors: at a vector width of 16, code that is 60% vectorized performs only twice as fast as non-vectorized code. Even with 90% of the workload vectorized, a speedup of around 6 is the best that can be expected. As can be seen from the graph, codes must have a very, very high percentage of vectorization in order to benefit from extra-wide vector units.

Today's reality is that CPUs are offering higher per-core mainly by increasing their instruction-level parallelism through wider vector registers and additional SIMD units. For code to benefit from newer hardware with more SIMD capabilities, applications must be vectorized to the greatest extent possible.

Assessing overall vectorization

How do you know if your application is vectorized as much as it can be? The overall vector efficiency of an application is defined as the ratio of the observed vector speedup to the ideal vector speedup. Let S be the observed speedup relative to non-vectorized code, and let N be the vector width, expressed in units of the common datatype of the code. The vector efficiency is then e = S/N. For example, if an application involves double precision math for which the vector length is 8, and if the vectorized application performs 4.7x faster than when vectorization is disabled, we calculate the overall vector efficiency to be 59%.

When we say overall vector efficiency, we mean that the measurement is to be applied to the whole code. An easy way to do this in practice is to use the Linux wrapper utility /bin/time, which you simply put ahead of the command that you use to run your application. This is done twice, once with and once without vectorization. (An alternative, on an architecture that supports vector instructions of different widths, is to compare vectorized performance at one vector width vs. another; the ideal speedup would then be the ratio of the two widths.)

The -no-vec compiler flag and #pragma novector directive in source code can be helpful in compiling code without vectorization, or selectively enabling/disabling vectorization for specific loops. The -no-vec option is particularly useful. It allows all other optimizations to take place, thus allowing speed comparisons between similarly optimized code, with the only difference being the absence of vectorization.

What result should you expect to see? As stated earlier, it is never realistic to expect perfect vector efficiency: Amdahl's Law implies there will be significant performance degradation due the unavoidable presence of serial code sections. We should therefore aim to compare the measured efficiency e with the maximum parallel efficiency predicted by Amdahl's Law, which is

\[ \max\{e\} = \max\{S/N\} = {1 \over (1-P)N + P} \]

Use of this formula requires knowledge of P, the vectorizable fraction of the workload. If P=1, the efficiency is 100%, independent of N; if P=0, the efficiency drops like 1/N, because only one of the vector "slots" is ever used.

If P is unknown, the Amdahl formula can be inverted to find P under the assumption that the measured efficiency is the maximum possible. In other words, we assume that all vectorized sections are getting the perfect speedup by a full factor of N. The result is

\[ P = {{N - {1\over\max\{e\}} \over {N - 1}}} = {{1 - {1\over S}} \over {1 - {1\over N}}} \]

If the vectorizable fraction computed by this formula seems low, it may be worth investing time to improve the application performance. One approach would be to look for areas of the code that did not vectorize and use the vector-aware coding strategies described earlier to try to fix these areas.

But it is also possible (even likely) that the already-vectorized code sections are not getting their full vector speedup, contrary to our assumption. In the rest of this section, we look at reasons why the actual speedup may be less than than your expectations based on Amdahl's Law, and what you can do about it.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement