This exercise will introduce using a vectorizing compiler. We will work with code containing a tight loop that should be easily vectorizable by the compiler. Our goal is to try out various compiler options and compare vectorized with non-vectorized code.

  1. Log in to a Stampede2 login node. Create a directory for code and executables for this example, or just use the top level of your home directory.
  2. Obtain a copy of the example code simple2.c and batch script simple2.sh. You can use wget to download them directly to Stampede2.
  1. Feel free to review the code. It is (as advertised!) a simple program that performs a basic multiply-add of arrays many times within a loop.
  1. The Intel compiler will automatically apply vectorization at optimization level -O2 and higher. Let's try the -O3 level and inspect the vectorization report to verify that our loop has indeed been vectorized. We'll want to work with Intel compiler version 17.0.5 or above, so remember to load a suitable environment module first, if necessary.
  1. It shows us that two loops were vectorized: the loop that initialized the data values, and the main computation loop.
  1. Now that we have determined which loops are vectorizable, let's look at what happens to the report when we compile with vectorization totally disabled. We'll call the executable simple2_no_vec.
  1. At report level 2, vectorization failures are also noted. Here, the compiler informs us that the failures occurred for a pretty obvious reason, namely, the inclusion of the -no-vec flag.
  1. As mentioned earlier, the Intel compiler will use SSE (128-bit) instructions by default. That means the compiled codes we have created so far will run on either Skylake or KNL, but probably not optimally. Let's now compile the C code with vectorization enabled once again, but this time add the flags -xCORE-AVX2 and -axCORE-AVX512,MIC-AVX512 to optimize it for Stampede2. This will build the code with 256-bit AVX2 instructions as the base, while adding in 512-bit AVX instructions for when the code runs on either SKX (CORE-AVX512) or KNL (MIC-AVX512). Name your executable simple2_avx512.
  1. If you open up the vectorization report this time, you will see that it actually contains a whole sequence of reports, each of which corresponds to a different CPU identity. One of these reports may include the odd remark, "Compiler has chosen to target XMM/YMM vector," along with a suggestion to try the option -qopt-zmm-usage=high. This is what we will do next.
  1. Let's try a final variant which should get rid of that odd remark in the preceding vector report. It comes from the fact that when code is auto-vectorized for Skylake, the Intel compilers will not necessarily use the 512-bit ZMM registers by default. Therefore, we apply the remedy that was discussed previously, which goes like this:
  1. In this case we have bestowed the fun name simple2_zmm_hi on the executable. Checking the vector report should show that any strange messages are now gone.
  1. Our goal now is to run the various compiled codes on the Xeon Phi and Xeon server CPUs to observe the impact of vectorization. The simple2.sh batch script will record the execution time for each of our vectorized and non-vectorized executables. In each case, the script pins the executable to a single core using the numactl command to reduce variations in timing.
  1. Inspect the batch script, submit it as shown below, and examine the output. The -p skx-dev option overrides the script's queue selection so you can make a comparison between KNL and SKX. You may need to add a -A option to specify an account, if you have multiple accounts on Stampede2.
  1. The KNL output is displayed first. As expected, overall times are longer on KNL than on SKX, partly because the base clock speed of SKX is faster. But the speed advantage of SKX is made even greater by Turbo Boost, which pushes the expected ratio of clock rates to at least 2.2x (3.5 GHz vs. 1.6 GHz for AVX-512, 1 core).
  1. Discussion. With either chip, the speedup observed in going from simple2_no_vec to simple2_zmm_hi—i.e., from unvectorized code to AVX-512—is 10-11x, significantly more than the 8x that one might predict based on vector size alone. The excess speedup is mostly achieved in the first step to SSE: instead of the anticipated 2x (128 bits = 2 doubles), the speedup factor is more like 2.7x. The further gain in going from SSE to AVX-512 looks to be more in line with the 4x that one expects based just on vector size. On KNL, measurements yield 0.89/0.22 = 4.0x, while on SKX, it is 0.21/0.09 = 2.3x until -qopt-zmm-usage=high is activated, when it improves to 0.21/0.05 = 4.2x.
  1. However, one should remember that in going from SSE to AVX2 or AVX-512, a further 2x is possible due to fused multiply-add (FMA) instructions. Perhaps SSE is able to do the equivalent of an FMA through hardware pipelining, which would explain why SSE performs so much better than expected. Detailed profiling and performance modeling of simple2.c would be required to understand the reasons for the various speedups observed on KNL and SKX.
  1. Finally, all the tests in this exercise involve only one thread of execution. For architectures such as MIC, though, a single thread represents only a fraction of the computing throughput that is available. Nothing prevents us from performing similar vector speedup tests with codes that have been additionally parallelized for multiple cores via OpenMP, e.g., to obtain results that would be more relevant to typical applications. (The implications of Turbo Boost become more complicated in that case.)
  1. Extra credit. What happens if we introduce a dependency into the initial loop? We expect that the compiler will refuse to vectorize it, but let's find out for sure via the vectorization report. Copy the original code into simple2_dep.c and edit it so that the first loop looks like this:
  1. Compile the modified code in the same way as before and inspect the vectorization report:
  1. You should see clear evidence that the compiler detected the dependency and declined to vectorize the bulk of the loop.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement