Special Compiler Options
We turn now to options that go beyond the standard "-O" selections. They enable the compiler to do tricks like using specialized hardware instructions that can offer significant performance gains.
By adding just an extra flag or two, you can let the compiler know it should try to optimize the code for a particular model of processor. It is especially easy to do this when the target model is the same one that is found in the machine on which the compilation takes place. But even in a cluster, the following options will still work appropriately, if the compilation is done in the context of a batch job on a targeted compute node:
Compilers | Flags to Target Host Architecture |
---|---|
Intel icc or ifort | -O2 -xHost |
GNU gcc or gfortran | -O2 -march=native -mtune=native |
If instead you want to "cross-compile" for a different target processor—e.g., if you are compiling a code on a cluster's login node in preparation for a batch job—you'll have to tell the compiler the exact type of processor you are targeting explicitly. For example the recommended options for maximizing performance on Intel Skylake processors are:
Compilers | Flags to Target Skylake (SKX) |
---|---|
Intel icc or ifort | -O3 -xCORE-AVX512 -qopt-zmm-usage=high |
GNU gcc or gfortran | -O3 -march=skylake-avx512 -mprefer-vector-width=512 |
Intel's -xCORE-AVX512
option, as well as GCC's -march=skylake-avx512
option, should
be valid for any Intel Xeon Scalable Processor model. GCC's -march
option includes a specific model
name, so you can set it to cascadelake
, icelake-server
, sapphirerapids
,
etc., as appropriate. (The model-specific -march
options are generally recognized by Intel, too.)
The option -qopt-zmm-usage=high
exists only for Intel compiler versions 17.0.5 and up.
Likewise, the latest architecture options might not exist in the base version of GCC that
comes pre-installed with your Linux OS. On TACC resources, try module spider intel
or
module spider gcc
to see which compiler versions are currently available, then run a
module load
or module swap
command that gives you a compiler version up-to-date
enough to cover the instruction set you need.
But what if you'd like to have a unified executable that can be run on multiple different architectures, without recompiling? One way would be to tell the compiler to restrict itself to a basic instruction set that is common to all the target processors. The Intel and GCC compilers can be told to do this. But the Intel compilers (only) can also take advantage of a feature called "automatic CPU dispatch" that lets a single executable run optimally on more than one kind of processor. To build this feature into your code, specify a base architecture that will work on all the target processors, then add a list of the specialized architectures. For example, on TACC Frontera, which has both Cascade Lake (AVX-512) Broadwell (AVX2) processor types, a unified code could be produced in these ways:
Compilers | Unifying Technique | Flags for One Combined AVX2/-512 Executable |
---|---|---|
Intel icc or ifort | Automatic CPU dispatch | -O3 -xCORE-AVX2 -axCORE-AVX512 |
Intel icc or ifort | Common AVX2 instructions | -O3 -xCORE-AVX2 |
GNU gcc or gfortran | Common AVX2 instructions | -O3 -march=broadwell |
The not-as-capable "common" options are included above because automatic CPU dispatch may not work in
all cases. Furthermore, even in the multi-architecture compilation, -qopt-zmm-usage=high
is
neglected because the base architecture (AVX2) does not recognize or support it.
For the rest of our discussion, we will neglect GCC to concentrate on icc and ifort (for C/C++ and Fortran respectively), Intel's well-known optimizing compilers for Intel-compatible processors. Again, Intel classic compilers are often preferred at TACC because they consistently provide better performance than the GNU suite for C/C++ and Fortran codes on Intel processors. However, there are many workflows that are directly dependent upon gcc and g++. For these workflows, please use gcc, but use the one loaded by a gcc module rather than the default system gcc. The gcc module will provide you with a more up-to-date and performant version of the compiler. The relevant command is
$ module swap intel gcc
Generally, gcc options also work with icc, because Intel has made an effort to maintain compatibility
with the most widely used compiler in scientific computing. Other compilers may have different flags
for the same options, or different features altogether. Consult your compiler's manual (either via
man <compiler_name>
, or via its --help
option) for details.
Some of the more important switches for the Intel compilers are listed below, and additional guidance can be found in the Frontera User Guide.
Intel Compiler Option | Description |
---|---|
-xHost | Generates code specialized to the host processor; enables the highest level of vectorization and other features supported on the processor on which you compile. It's an easy way to do -x<architecture_code> for the host's CPU type. (Not always the best choice on clusters, due to the different processor types found on compute nodes and login nodes.) |
-ip | Enables inter-procedural optimizations within files, while keeping track of original line numbers for debugging. |
-ipo | Produces optimizations which combine code in different files. May lead to longer compilation times. |
-qopt-prefetch[=n] | Enables various levels of data prefetching. Level n can be from 0 to 4; -opt-prefetch=3 is included in -O2. |
-assume buffered_io | Ensures buffered I/O from your Fortran executables (recommended on TACC systems) |
-static | Create a static executable (i.e., do not link with shared-object libraries). |
-no-prec-div | Enables optimizations that give slightly less precise results than full IEEE division. |
-fp-model fast[=1|2] | Requests more aggressive optimizations for floating-point math. |
-fast | Means the same as -O3 -xHost -ipo -static -no-prec-div -fp-model fast=2. Note, this option sets "-static" which is incompatible with MPI on TACC systems, as the MPI libraries are all dynamic. See also the above note about -xHost. |
-qopenmp | Enables parallelizer to generate multithreaded code based on the OpenMP directives. |
-diag-type=diag-list
(e.g.) -diag-enable=vec |
Displays various user-controlled diagnostic information from static analysis, including vector diagnostic reporting (so you can be sure your innermost loops are vectorized!), which loops have successfully auto-parallelized, OpenMP messages, and many others. |
-g | Generates a symbol table and compile for debugging, if you want to debug with gdb or see which line in the source code caused a crash. While -g is compatible with -O2, use -O0 if you want a more accurate backtrace. |
-check..., -check-pointers... |
Builds in various run-time and compile-time checks that are useful for debugging and safety-critical code. C and Fortran selections differ. Options like these should likely be removed for production HPC work due to overhead in run-time checks. |
-fpe0 | Enables floating point exceptions at run time for Fortran codes. Can be useful for debugging. |
-mp1 | Improves floating-point precision and consistency, at a small cost to speed. |
-strict_ansi | Enforces strict compliance with the ANSI language standard for C codes. This might take away some optimization tricks, so code may run slower. |
The only way to know for sure whether options like -O3
, -qopt-prefetch=4
,
and -fp-model fast=2
are actually improving your performance is to run the experiments
and find out. Results will generally be application-dependent.