For the Intel compilers, automatic vectorization occurs at an optimization level of -O2 or higher; for GCC, this occurs with -O2 -ftree-vectorize or just -O3. But it turns out that these flags may not be sufficient for getting the very best results. The Intel and GNU compilers (all versions) currently produce SSE instructions by default, because SSE is pretty nearly universal across different processor types. Today, however, the 128-bit vector length specified by SSE is often less than optimal. This means that the developer must give additional options to the compiler to generate binary code that is suitable for a more recent target architecture.

For example: the ICX, SKX, and KNL processors on Stampede2 support variants of the AVX-512 instruction set, which assumes a 512-bit vector width. The default SSE instructions would evidently use only a fraction of the vector processing capability of the Stampede2 CPUs. Moreover, the older-style x86 vector instructions would only run on 1 of the 2 VPUs per core on KNL. So giving the right flags to the compiler becomes fairly crucial for getting the best performance from Stampede2.

What are the right options to choose? (Note, in what follows, we will assume the application has no real need of the extra instructions available in ICX.) Here are the compiler options that ought to work best on Stampede2:

Specifying the -xCORE-AVX512 flag instructs the compiler to draw from the whole suite of instructions available for SKX processors, including ones for 512-bit-wide vectors. The resulting binaries, however, are not runnable on KNL or on any other hardware that does not support the exact instruction set implied by that particular flag. For KNL, the best match is instead the -xMIC-AVX512 flag; but again, the use of that flag precludes running the resulting binaries on SKX or other processor types.

Is there a way in which we can cover both types of Stampede2 nodes at once? The answer is yes. Use the -ax flag to enable the automatic dispatching of instructions depending on the CPU hardware:

This kind of executable is known as a "fat binary" because it contains instructions for multiple architectures, and it has the ability to switch among them depending on the type of hardware that is detected at run time. In the above case, the default architecture is AVX2, with CORE-AVX512 and MIC-AVX512 being valid alternatives.

For Intel 17.0.5 and above, a variant of the above compilation may work better for numerically intensive HPC codes. The Intel compiler doesn't make the most aggressive assumptions when vectorizing codes for Skylake, because SKX is a general-purpose processor that may encounter a wide variety of workloads. The compiler may therefore require extra prodding to go ahead and include instructions for the 512-bit ZMM registers on SKX, as follows:

Note that choosing -xCOMMON-AVX512 as the base architecture is a prerequisite for activating the option -qopt-zmm-usage=high, which tells the compiler to use 512-bit vector instructions as much as possible. (The default is low on SKX, high on KNL.) This flag may or may not make a difference in performance on SKX, but including it can sometimes be advantageous. If the code is to be run on Skylakes only, it can also be compiled this way:

When it happens that the login nodes are architecturally identical to the compute nodes, a typical shortcut is to use -xHost to detect the important features of the host hardware automatically (AVX2 or AVX-512, etc.) and compile for exactly those features. However, -qopt-zmm-usage=high is an example of an architectural option that isn't captured via -xHost.

If automatic vectorization is not desired, it can be disabled at any optimization level by specifying -no-vec. The -no-vec option can be useful for determining vector speedup, as well as for profiling purposes, as we shall see later.

For the GNU compilers, the architecture-specific options for KNL, SKX, and ICX are -march=knl, -march=skylake-avx512, and -march=icelake-server respectively, while the -march=native flag produces a binary suitable for the machine on which the source is compiled. These GCC options work for the Intel compilers as well. The GCC-only option to give preference to AVX-512 instructions is -mprefer-vector-width=512, while the option for disabling vectorization altogether is -fno-tree-vectorize (must be subsequent to -O3 on the command line).

Important note: For code with function calls, the compiler may be able to achieve better optimization, including vectorization, if interprocedural optimization is in effect. Enable it using -ipo for the Intel compilers or -flto (for "link time optimization") for GCC. This allows the compiler/linker to inline and vectorize a function that is called in a loop in a different source file, for example.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement