We turn now to options that go beyond the standard "-O" selections. They enable the compiler to do tricks like using specialized hardware instructions that can offer significant performance gains.

By adding just an extra flag or two, you can let the compiler know it should try to optimize the code for a particular model of processor. It is especially easy to do this when the target model is the same one that is found in the machine on which the compilation takes place. But even in a cluster, the following options will still work appropriately, if the compilation is done in the context of a batch job on a targeted compute node:

Compiler options that optimize a code for the processor type on the machine where it is being compiled.
Compilers Flags to Target Host Architecture
Intel icc or ifort -O2 -xHost
GNU gcc or gfortran -O2 -march=native -mtune=native

If instead you want to "cross-compile" for a different target processor—e.g., if you are compiling a code on a cluster's login node in preparation for a batch job—you'll have to tell the compiler the exact type of processor you are targeting explicitly. For example the recommended options for maximizing performance on Intel Skylake processors are:

Compiler options that optimize a code for Intel Skylake (SKX) processors.
Compilers Flags to Target Skylake (SKX)
Intel icc or ifort -O3 -xCORE-AVX512 -qopt-zmm-usage=high
GNU gcc or gfortran -O3 -march=skylake-avx512 -mprefer-vector-width=512

Intel's -xCORE-AVX512 option, as well as GCC's -march=skylake-avx512 option, should be valid for any Intel Xeon Scalable Processor model. GCC's -march option includes a specific model name, so you can set it to cascadelake, icelake-server, sapphirerapids, etc., as appropriate. (The model-specific -march options are generally recognized by Intel, too.)

The option -qopt-zmm-usage=high exists only for Intel compiler versions 17.0.5 and up. Likewise, the latest architecture options might not exist in the base version of GCC that comes pre-installed with your Linux OS. On TACC resources, try module spider intel or module spider gcc to see which compiler versions are currently available, then run a module load or module swap command that gives you a compiler version up-to-date enough to cover the instruction set you need.

But what if you'd like to have a unified executable that can be run on multiple different architectures, without recompiling? One way would be to tell the compiler to restrict itself to a basic instruction set that is common to all the target processors. The Intel and GCC compilers can be told to do this. But the Intel compilers (only) can also take advantage of a feature called "automatic CPU dispatch" that lets a single executable run optimally on more than one kind of processor. To build this feature into your code, specify a base architecture that will work on all the target processors, then add a list of the specialized architectures. For example, on TACC Frontera, which has both Cascade Lake (AVX-512) Broadwell (AVX2) processor types, a unified code could be produced in these ways:

Compiler options allowing an executable to run on both Broadwell and Cascade Lake processor types.
Compilers Unifying Technique Flags for One Combined AVX2/-512 Executable
Intel icc or ifort Automatic CPU dispatch -O3 -xCORE-AVX2 -axCORE-AVX512
Intel icc or ifort Common AVX2 instructions -O3 -xCORE-AVX2
GNU gcc or gfortran Common AVX2 instructions -O3 -march=broadwell

The not-as-capable "common" options are included above because automatic CPU dispatch may not work in all cases. Furthermore, even in the multi-architecture compilation, -qopt-zmm-usage=high is neglected because the base architecture (AVX2) does not recognize or support it.

For the rest of our discussion, we will neglect GCC to concentrate on icc and ifort (for C/C++ and Fortran respectively), Intel's well-known optimizing compilers for Intel-compatible processors. Again, Intel classic compilers are often preferred at TACC because they consistently provide better performance than the GNU suite for C/C++ and Fortran codes on Intel processors. However, there are many workflows that are directly dependent upon gcc and g++. For these workflows, please use gcc, but use the one loaded by a gcc module rather than the default system gcc. The gcc module will provide you with a more up-to-date and performant version of the compiler. The relevant command is


$ module swap intel gcc

Generally, gcc options also work with icc, because Intel has made an effort to maintain compatibility with the most widely used compiler in scientific computing. Other compilers may have different flags for the same options, or different features altogether. Consult your compiler's manual (either via man <compiler_name>, or via its --help option) for details.

Some of the more important switches for the Intel compilers are listed below, and additional guidance can be found in the Frontera User Guide.

Selected compiler options for the Intel classic compilers.
Intel Compiler Option Description
-xHost Generates code specialized to the host processor; enables the highest level of vectorization and other features supported on the processor on which you compile. It's an easy way to do -x<architecture_code> for the host's CPU type. (Not always the best choice on clusters, due to the different processor types found on compute nodes and login nodes.)
-ip Enables inter-procedural optimizations within files, while keeping track of original line numbers for debugging.
-ipo Produces optimizations which combine code in different files. May lead to longer compilation times.
-qopt-prefetch[=n] Enables various levels of data prefetching. Level n can be from 0 to 4; -opt-prefetch=3 is included in -O2.
-assume buffered_io Ensures buffered I/O from your Fortran executables (recommended on TACC systems)
-static Create a static executable (i.e., do not link with shared-object libraries).
-no-prec-div Enables optimizations that give slightly less precise results than full IEEE division.
-fp-model fast[=1|2] Requests more aggressive optimizations for floating-point math.
-fast Means the same as -O3 -xHost -ipo -static -no-prec-div -fp-model fast=2. Note, this option sets "-static" which is incompatible with MPI on TACC systems, as the MPI libraries are all dynamic. See also the above note about -xHost.
-qopenmp Enables parallelizer to generate multithreaded code based on the OpenMP directives.
-diag-type=diag-list
(e.g.)
-diag-enable=vec
Displays various user-controlled diagnostic information from static analysis, including vector diagnostic reporting (so you can be sure your innermost loops are vectorized!), which loops have successfully auto-parallelized, OpenMP messages, and many others.
-g Generates a symbol table and compile for debugging, if you want to debug with gdb or see which line in the source code caused a crash. While -g is compatible with -O2, use -O0 if you want a more accurate backtrace.
-check...,
-check-pointers...
Builds in various run-time and compile-time checks that are useful for debugging and safety-critical code. C and Fortran selections differ. Options like these should likely be removed for production HPC work due to overhead in run-time checks.
-fpe0 Enables floating point exceptions at run time for Fortran codes. Can be useful for debugging.
-mp1 Improves floating-point precision and consistency, at a small cost to speed.
-strict_ansi Enforces strict compliance with the ANSI language standard for C codes. This might take away some optimization tricks, so code may run slower.

The only way to know for sure whether options like -O3, -qopt-prefetch=4, and -fp-model fast=2 are actually improving your performance is to run the experiments and find out. Results will generally be application-dependent.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement