For most compilers, automatic vectorization occurs at an optimization level of -O2 or higher; for GCC prior to gcc-12, this occurs with -O2 -ftree-vectorize or -O3. But it turns out that these flags may not be sufficient for getting the very best results. The Intel and GNU compilers (all versions) currently produce SSE instructions by default, because SSE is pretty nearly universal across different processor types. Today, however, the 128-bit size specified by SSE is often less than optimal. This means that the developer must give additional options to the compiler to generate binary code that is suitable for a more recent target architecture.

For example: modern Intel processors support variants of the AVX-512 instruction set, which assumes a 512-bit vector width. The default SSE instructions would evidently use only a fraction of the vector processing capability of these Intel CPUs. So giving the right flags to the compiler becomes fairly crucial for getting the best performance, for applications that can make use of vector processing units.

What are the right options to choose? Here are the compiler options that often work best on the Skylake or SKX nodes of Stampede3:

Specifying the -xCORE-AVX512 flag instructs the compiler to draw from the whole suite of instructions available for SKX processors, including ones for 512-bit-wide vectors. The resulting binary, however, is not runnable on any older hardware that does not support the exact instruction set implied by that particular flag.

It turns out that the Intel compiler might not make the most aggressive assumptions when vectorizing codes for SKX, because SKX is general-purpose processor that may encounter a wide variety of workloads. The compiler may therefore require extra prodding to go ahead and include instructions for the 512-bit ZMM registers on SKX, as follows:

This flag may or may not make a difference in performance on SKX, but including it can sometimes be advantageous.

When it happens that the login nodes are architecturally identical to the compute nodes, a typical shortcut is to use -xHost to detect the important features of the host hardware automatically (AVX2 or AVX-512, etc.) and compile for exactly those features. However, -qopt-zmm-usage=high is an example of an architectural option that isn't captured via -xHost.

If automatic vectorization is not desired, it can be disabled at any optimization level by specifying -no-vec. The -no-vec option can be useful for determining vector speedup, as well as for profiling purposes, as we shall see later.

For the GNU compilers, the architecture-specific options for SKX and ICX are -march=skylake-avx512, and -march=icelake-server respectively, while the -march=native flag produces a binary suitable for the machine on which the source is compiled. These GCC options work for the Intel compilers as well. The GCC-only option that gives preference to AVX-512 instructions is -mprefer-vector-width=512, while the option for disabling vectorization altogether is -fno-tree-vectorize (it must be subsequent to -O3 on the command line).

Important note: For code with function calls, the compiler may be able to achieve better optimization, including vectorization, if interprocedural optimization is in effect. Enable it using -ipo for the Intel compilers or -flto (for "link time optimization") for GCC. This allows the compiler/linker to inline and vectorize a function that is called in a loop in a different source file, for example.

Historical notes: The former Stampede2 included KNL nodes, for which the correct architecture flag was -xMIC-AVX512 in Intel's former "classic compiler" icc. However, the use of that flag precluded running the resulting binaries on SKX or other processor types. Is there a way to cover both types of nodes at once? The answer is yes. With Intel compilers, one can use the -ax flag to enable the automatic dispatching of instructions depending on the CPU hardware:

This kind of executable is known as a "fat binary" because it contains instructions for multiple architectures, and it has the ability to switch among them depending on the type of hardware that is detected at run time. In the above case, the default architecture is CORE-AVX2, with CORE-AVX512 and MIC-AVX512 being valid alternatives.

If one added a flag to create a vectorization report to the above compilation, one would actually obtain a whole sequence of reports, each of which corresponded to a different CPU identity. One of these reports might include the odd remark, "Compiler has chosen to target XMM/YMM vector," along with a suggestion to try the option -qopt-zmm-usage=high. Can a "fat binary" include that option, too? The answer is again yes—but a different default architecture must be specified.

For icc 17.0.5 and above, the following variant of the above compilation could work better for a numerically intensive HPC code, in case it might need to run on any of the Stampede2 compute nodes:

It turns out that choosing -xCOMMON-AVX512 as the base architecture is a prerequisite for activating the option -qopt-zmm-usage=high, which tells the compiler to use 512-bit vector instructions as much as possible. (The default is low on SKX, high on KNL.)

In icx and the other Intel oneAPI compilers, the MIC-AVX512 architecture is no longer available, and the -ax option seems less effective. However, -ax and most of the other options from the Intel classic compilers are still supported.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)