Guidance on Instruction Sets
We highlight here some compilation options that select various instruction sets and how these options can affect code performance. These issues, along with some other factors relevant for performance, are described more extensively by TACC in the User Guides for Stampede3 and Frontera:
- Performance (Stampede3)
- Programming and Performance: CLX (Frontera)
Both guides linked above address several details related to instruction sets:
- Clock Speed: Although the SKX nodes on Stampede3 have a nominal clock speed of 2.1 GHz, and the CLX chips on Frontera a speed of 2.7 GHz, actual clock speed can vary widely in each case, depending on the vector instruction set used, the number of active cores, and other factors affecting power requirements and temperature limits.
- Vector Optimization and AVX2: Because clock speed can vary with vector instruction set, some applications might run faster using the 256-bit AVX2 instruction set rather than the AVX-512 set.
- Vector Optimization and 512-Bit ZMM Registers: The "qopt-zmm-usage" option affects whether the compiler will vectorize a loop with AVX-512 intrinsics (512-bit registers,
-qopt-zmm-usage=high
) or AVX2 intrinsics (256-bit registers,-qopt-zmm-usage=low
). Despite the fact that the-xCORE-AVX512
flag targets the AVX-512 instruction set, the default zmm-usage for that architecture target is low. So you might want to explicitly set-qopt-zmm-usage=high
on SKX/CLX to see if your code can make more efficient use of the wide 512-bit registers
We will see some of the implications of some of these issues in subsequent exercises.
The full "AVX-512 instruction set" is actually a collection of subsets of instructions, many of which are valid on both SKX and CLX, but some only on CLX, and some only on Intel's former KNL processors. KNL turns out to be a bit of an outlier, in that only a small slice of AVX-512 is common to KNL and more recent Xeons. The different subsets of AVX-512 instructions are summarized in the table below, which is updated from a similar table in our companion material on vectorization.
Extension | CLX | SKX | KNL | Functionality |
---|---|---|---|---|
AVX-512F | X | X | X | Foundation: expands upon AVX to support 512-bit registers; adds masked operations and other new features. |
AVX-512CD | X | X | X | Conflict Detection: permits the vectorization of loops with certain kinds of write conflicts. |
AVX-512BW | X | X | Byte and Word: adds support for vectors comprised of bytes, or of 8- or 16-bit integers; allows masked operations. | |
AVX-512DQ | X | X | Doubleword and Quadword: adds instructions for vectors of 32- or 64-bit integers; allows masked operations. | |
AVX-512VL | X | X | Vector Length: enables AVX-512 to work with up to 32 of the smaller-size SSE or AVX registers; allows masked operations.1 | |
AVX-512PF | X | Prefetch: adds prefetch operations for the gather and scatter functionality introduced in AVX2 and AVX-512. | ||
AVX-512ER | X | Exponential and Reciprocal: includes new operations for 2^x exponentials, reciprocals, and reciprocal square roots. | ||
AVX-512VNNI | X | Vector Neural Network Instructions: adds new fused operations on 8- or 16-bit integers for the inner convolution loop often encountered in deep learning neural network computations. |
The Intel compilers bundle these subsets in different ways, depending on what target architecture is selected with the -x
flag. For example, -xCORE-AVX512
will generate instructions for both SKX and CLX, based upon the subsets of instructions that they share. The GNU compilers provide an analogous -march
flag, but they also support the opportunity to include specific instruction subsets, if that is of interest. For SKX processors, the instruction subsets include: Foundation, Conflict Detection, Byte and Word, Doubleword and Quadword, and Vector Length. So instead of using the -march=skylake-avx512
, those instructions could be included in a call to the GNU compilers as follows:
$ gcc -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -O3 -fopenmp omp_hello.c -o omp_hello
$ ### OR ###
$ gfortran -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -O3 -fopenmp omp_hello.f90 -o omp_hello
Again, it is important to recognize that most of the AVX-512 instructions pertain to vectorizable code. We will return to this point later.