This exercise will introduce using a vectorizing compiler. We will work with code containing a tight loop that should be easily vectorizable by the compiler. Our goal is to try out various compiler options and compare vectorized with non-vectorized code.

The steps in this exercise have been tested on Stampede3 and should work there, though the timing results on a Skylake (SKX) node may not turn out exactly the same. The steps should also work, with some adjustment, on a different system or processor type.

  1. Log in to a login node. Create a directory for code and executables for this exercise, or just use the top level of your home directory.
  2. Obtain a copy of the example code simple2.c and batch script simple2.sh. You can use wget to download them directly.
  1. Feel free to review the code. It is (as advertised!) a simple program that performs a basic multiply-add of arrays many times within a loop. Notice that the final result must be printed out, because if it is not, the bulk of the code may be "optimized away" to do nothing at all (and do it very quickly!).
  1. The Intel compiler will automatically apply vectorization at optimization level -O2 and higher. Let's try the -O3 level and inspect the vectorization report to verify that our loop has indeed been vectorized. The commands should work with icx, version 25.1 or above, and likewise with the Intel classic compiler icc, version 17.0.5 or above. Remember to load a suitable environment module first, if that is needed. (Versions of icx prior to 25.1 may not support the -qopt-report-phase option; it can be omitted, if that's the case.)
  1. Report level 2 records the vectorizer's failures as well as its successes. Here, the report is showing us that the loop that initializes the data values was not vectorized, while the main computation loop was vectorized.
  1. Now that we have determined which loops are vectorizable, let's look at what happens to the report when we compile with vectorization totally disabled. We'll call the executable simple2_no_vec.
  1. This time, the compiler gives no indication of loops being vectorized. The obvious reason is the -no-vec flag.
  1. As mentioned earlier, the Intel compiler will use SSE (128-bit) instructions by default. That means the compiled codes we have created so far will run on AVX2 and AVX-512 processors, but probably not optimally. Let's now compile the C code with vectorization enabled once again, but this time add the flag -xCORE-AVX512 to optimize it for more recent processor types. This will build the code with 512-bit AVX instructions for SKX and subsequent generations. We'll name the executable simple2_avx512.
  1. The vectorization report now shows that the main loop was vectorized using a vector length of 4 doubles (256 bits). However, AVX-512 corresponds to a vector length of 8. So why wasn't 8 used? It turns out that when the Intel compilers auto-vectorize code for Skylake, they will not necessarily use the 512-bit ZMM registers by default, because it might not improve performance.
  1. Let's try a final variant to address what we learned from the preceding vector report. We'll apply the remedy that was discussed previously, namely, adding the option -qopt-zmm-usage=high. The "high" setting overrides the compiler's preferences and ensures that the 512-bit ZMM registers will be used.
  1. In this case we have bestowed the fun name simple2_zmm_hi on the executable. Checking the vector report should show the vector length is now 8.
  1. Our next goal will be to run the various compiled codes on a Intel Xeon AVX-512 server via Slurm, to observe the impact of vectorization. The simple2.sh batch script will record the execution time for each of our vectorized and non-vectorized executables. In each case, the script pins the executable to a single core using the numactl command to reduce variations in timing.
  1. You should inspect the Slurm batch script and change the queue name as necessary. You may need to add a -A option to specify an account, if you have multiple accounts. Submit it as shown below and examine the output.
  1. Surprisingly, on a single SKX core, the unvectorized code runs almost twice as fast as the SSE vectorized code! This is not simply a consequence of Turbo Boost, because the latter pushes up the single-core clock rate to 3.5-3.7 GHz for all vector types. It's more likely that with vectorization disabled, the compiler is able to do different code transformations at the -O3 optimization level which happen to outperform SSE on the available hardware. For example, it may be that Skylake's capabilities with pipelining and instruction-level parallelism are more crucial to performance than SSE vectorization.
  1. Discussion. The speedup observed in going from simple2 to simple2_zmm_hi—i.e., from SSE to AVX-512—is 7x, significantly more than the 4x that one might predict based on vector size alone. However, one should remember that in taking the first step from SSE to AVX2, a further 2x becomes available due to fused multiply-add (FMA) instructions. And indeed, we do see roughly 4x when moving from SSE to AVX2 (simple2_avx512, effectively AVX2 on Skylake), followed by a further 2x in going from AVX2 to full AVX-512 (simple2_zmm_hi).
  1. However, this does not tell us how disabling vectorization results in a nearly 2x speedup over SSE on Skylake. Perhaps -no-vec is able to unlock the equivalent of two FMAs per cycle, through hardware pipelining? SSE can't match that, as it lacks an FMA instruction. Detailed profiling and performance modeling of simple2.c would be required to understand the reasons for the various speedups observed on SKX.
  1. Finally, all the tests in this exercise involve only one thread of execution. For modern multicore architectures, a single thread represents only a fraction of the computing throughput that is available. Nothing prevents us from performing similar vector speedup tests with codes that have been additionally parallelized for multiple cores via OpenMP, e.g., to obtain results that would be more relevant to typical applications. (The implications of Turbo Boost become more complicated in that case.)
  1. Extra credit. What happens if we introduce a dependency into the initial loop? We expect that the compiler will refuse to vectorize it, but let's find out for sure via the vectorization report. Copy the original code into simple2_dep.c and edit it so that the main loops look like this:
  1. Compile the modified code in the same way as before and inspect the vectorization report:
  1. You should see clear evidence that the compiler detected the dependency and declined to vectorize the bulk of the loop.
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)