Exercise
This exercise will introduce using a vectorizing compiler. We will work with code containing a tight loop that should be easily vectorizable by the compiler. Our goal is to try out various compiler options and compare vectorized with non-vectorized code.
The steps in this exercise have been tested on Stampede3 and should work there, though the timing results on a Skylake (SKX) node may not turn out exactly the same. The steps should also work, with some adjustment, on a different system or processor type.
- Log in to a login node. Create a directory for code and executables for this exercise, or just use the top level of your home directory.
-
Obtain a copy of the example code
simple2.cand batch scriptsimple2.sh. You can usewgetto download them directly.
- Feel free to review the code. It is (as advertised!) a simple program that performs a basic multiply-add of arrays many times within a loop. Notice that the final result must be printed out, because if it is not, the bulk of the code may be "optimized away" to do nothing at all (and do it very quickly!).
-
The Intel compiler will automatically apply vectorization at optimization level
-O2and higher. Let's try the-O3level and inspect the vectorization report to verify that our loop has indeed been vectorized. The commands should work withicx, version 25.1 or above, and likewise with the Intel classic compilericc, version 17.0.5 or above. Remember to load a suitable environment module first, if that is needed. (Versions oficxprior to 25.1 may not support the-qopt-report-phaseoption; it can be omitted, if that's the case.)
- Report level 2 records the vectorizer's failures as well as its successes. Here, the report is showing us that the loop that initializes the data values was not vectorized, while the main computation loop was vectorized.
-
Now that we have determined which loops are vectorizable, let's look at what happens to the
report when we compile with vectorization totally disabled. We'll call the executable
simple2_no_vec.
-
This time, the compiler gives no indication of loops being vectorized. The obvious reason is the
-no-vecflag.
-
As mentioned earlier, the Intel compiler will use SSE
(128-bit) instructions by default. That means the compiled codes we have created so far
will run on AVX2 and AVX-512 processors, but probably not optimally. Let's now compile the C code
with vectorization enabled once again, but this time add the flag
-xCORE-AVX512to optimize it for more recent processor types. This will build the code with 512-bit AVX instructions for SKX and subsequent generations. We'll name the executablesimple2_avx512.
- The vectorization report now shows that the main loop was vectorized using a vector length of 4 doubles (256 bits). However, AVX-512 corresponds to a vector length of 8. So why wasn't 8 used? It turns out that when the Intel compilers auto-vectorize code for Skylake, they will not necessarily use the 512-bit ZMM registers by default, because it might not improve performance.
-
Let's try a final variant to address what we learned from the preceding vector report. We'll apply the remedy that was discussed previously,
namely, adding the option
-qopt-zmm-usage=high. The "high" setting overrides the compiler's preferences and ensures that the 512-bit ZMM registers will be used.
-
In this case we have bestowed the fun name
simple2_zmm_hion the executable. Checking the vector report should show the vector length is now 8.
-
Our next goal will be to run the various compiled codes on a Intel Xeon AVX-512 server via Slurm,
to observe the impact of vectorization. The
simple2.shbatch script will record the execution time for each of our vectorized and non-vectorized executables. In each case, the script pins the executable to a single core using the numactl command to reduce variations in timing.
-
You should inspect the Slurm batch script and change the queue name as necessary.
You may need to add a
-Aoption to specify an account, if you have multiple accounts. Submit it as shown below and examine the output.
-
Surprisingly, on a single SKX core, the unvectorized code runs almost twice as fast as the SSE vectorized code! This is not simply a consequence of
Turbo Boost, because the latter pushes up the single-core clock rate to 3.5-3.7 GHz for all vector types.
It's more likely that with vectorization disabled, the compiler is able to do different code transformations at the
-O3optimization level which happen to outperform SSE on the available hardware. For example, it may be that Skylake's capabilities with pipelining and instruction-level parallelism are more crucial to performance than SSE vectorization.
-
Discussion. The speedup observed in going from
simple2tosimple2_zmm_hi—i.e., from SSE to AVX-512—is 7x, significantly more than the 4x that one might predict based on vector size alone. However, one should remember that in taking the first step from SSE to AVX2, a further 2x becomes available due to fused multiply-add (FMA) instructions. And indeed, we do see roughly 4x when moving from SSE to AVX2 (simple2_avx512, effectively AVX2 on Skylake), followed by a further 2x in going from AVX2 to full AVX-512 (simple2_zmm_hi).
-
However, this does not tell us how disabling vectorization results in a nearly 2x speedup over SSE on Skylake. Perhaps
-no-vecis able to unlock the equivalent of two FMAs per cycle, through hardware pipelining? SSE can't match that, as it lacks an FMA instruction. Detailed profiling and performance modeling ofsimple2.cwould be required to understand the reasons for the various speedups observed on SKX.
- Finally, all the tests in this exercise involve only one thread of execution. For modern multicore architectures, a single thread represents only a fraction of the computing throughput that is available. Nothing prevents us from performing similar vector speedup tests with codes that have been additionally parallelized for multiple cores via OpenMP, e.g., to obtain results that would be more relevant to typical applications. (The implications of Turbo Boost become more complicated in that case.)
-
Extra credit. What happens if we introduce a dependency into the initial loop?
We expect that the compiler will refuse to vectorize it, but let's find out for sure via
the vectorization report. Copy the original code into
simple2_dep.cand edit it so that the main loops look like this:
- Compile the modified code in the same way as before and inspect the vectorization report:
- You should see clear evidence that the compiler detected the dependency and declined to vectorize the bulk of the loop.
©
|
Cornell University
|
Center for Advanced Computing
|
Copyright Statement
|
Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)