Vectorization represents the final dimension in our "3D scaling" scheme. Now that we've scaled out and up, how deep can we scale? The answer depends on the number of vector slots that are available for floating-point calculations. In this exercise we examine vectorization using just a single core of one CLX node (instead of continuing to build on our previous hybrid example). You might also like to consult our companion tutorial on vectorization.

Download vector_ma.c either directly via the link, or by copying the text and pasting it into a file as before. The icc compilation command to use on Frontera is (use icx on Stampede3):

$ icc -xCORE-AVX512 -qopt-zmm-usage=high -O3 -qopenmp vector_ma.c -o vector_ma

The first three flags, -xCORE-AVX512 -qopt-zmm-usage=high -O3, ensure that the main loop in the code is auto-vectorized in the best possible way for CLX and SKX nodes. The -qopenmp flag is necessary not because the code is parallelized in any way, but because the main loop is timed with an OpenMP timer. We'll want to compare the running times of vectorized and unvectorized code, so let's compile the code a second time with one extra flag:

$ icc -xCORE-AVX512 -qopt-zmm-usage=high -O3 -qopenmp -no-vec vector_ma.c -o no_vector_ma

The -no-vec option disables the vectorization that would normally occur due to the first three options.

You can run the executables either by using Slurm's sbatch command to submit the following script to Frontera (with modifications if you aim to use it in the skx-dev queue on Stampede3), or by opening an interactive session on a Frontera compute node using idev (no flags) so you can enter the two commands directly into a shell there:

If you use the above batch script (which you can name run_vector_ma.sh, perhaps), remember to edit the account number to change it to one that you are enabled to use, and look for the job's output in a file named slurm-######.out.

When you run the two programs, the output should indicate that the vectorized code ran approximately 6.1x faster than the unvectorized one on Frontera.

Hold on, though—weren't we expecting a 64x speedup, due to the fact that we can in principle get 64 vector flops per cycle per core? No, because first of all vector_ma.c uses doubles, and the multiplier was computed for floats. That reduces our expectation by 2x. Moreover, scalar (non-vector) operands also have an FMA instruction ("vfmadd213pd") they can utilize, again reducing the relative speedup by 2x. A final 2x reduction comes from a more subtle reason; the explanation highlights an important limit of computational performance.

Let's have a look at the main loop in the code. It's clearly simple enough to vectorize well:


  for( i = 0; i < 1000000; i++){
    ifrac = ((i+1)*2.5/1000000.0);
    #pragma vector aligned
    for( j = 0; j < N; j++){
      x[j] = ifrac*y[j] + z[j];
    }
  }

The #pragma assures the compiler that all arrays start on cache-line boundaries, which is optimal.1 The heart of the nested loops is a fused multiply-add operation; this is also optimal. What's lacking?

The additional (and non-obvious) problem is that the loop's performance is memory-bound. The small arrays x, y, z all fit into L1d cache; however, a CLX/SKX core is only able to load 2 vectors and store 1 vector per cycle between its L1d and registers. As a result, only 1 vector unit is active per cycle, and not 2. This is why our expectation must be reduced by another 2x, to 8x. This speedup is close to what is observed.

The bottom line is that vectorization with AVX-512 can only give you a 64x speedup if you are using single precision, and if your utilization of the cache and vector registers is perfect. Roughly half of all your operations should involve constants! For more insight, please refer to Vectorization.

Interestingly, if you compile the code instead to use the AVX2 instruction set, you see a different amount of speedup due to vectorization. Compile the code as instructed below, run the executable vector_ma_AVX2, and compare its run time to the unvectorized code in no_vector_ma.

$ icc -xCORE-AVX2 -O3 -qopenmp vector_ma.c -o vector_ma_AVX2

You should see an approximate speedup of 1.6x due to AVX2 vectorization. As described in our material on vectorization and registers, the AVX (and AVX2) instructions are able to act on 4 double precision numbers at a time, whereas AVX-512 operates on 8 doubles at a time. Therefore, we might have expected a 4x speedup for AVX2, in analogy to the 8x speedup predicted for AVX-512. Instead, the AVX2 result seems to fall well below what was anticipated. As noted in our Guidance on Instruction Sets, the interplay between register widths, instruction sets, and clock speeds can be complicated... so the message here is that it can be worthwhile for you to benchmark the code you are running to understand what combination of options works best for your particular application.

Extra credit

What happens if you increase N from 256 to 8192 by editing the code and recompiling? By comparing the vectorized and unvectorized results, do you see more effect on vectors, or on scalars? (The L1d cache holds 4096 doubles in total; think about whether the lower bandwidth from L2 would have more effect on vectors, or on scalars.)

1It is justified here because the arrays are allocated with the memalign function, which which guarantees such alignment. That function applies to heap memory; fixed-size arrays on the stack may be aligned with the alignas(64) specifier in C and C++11.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement