Cornell Virtual Workshop > Code Optimization > Optimization via Compilers

Exercise

This exercise will demonstrate how compiler options help you optimize the performance of hand-coded routines, and how performance can be improved by calling numerical libraries.

You will be working with three different versions of a code to solve a system of linear equations via LU factorization. The ludecomp repository contains all three codes, together with a makefile to compile them and a script to submit them to the scheduler. Embedded in the script are instructions that time each code and put the timing data into an output file.

Make sure you have a relatively modern version of Git installed. On Frontera, a modern version can be used by loading an environment module (this module is generally loaded by default):

$ module load git

Now get the source code with:

$ cd ~
$ git clone https://github.com/cornellcac/ludecomp.git

You will find a directory called ludecomp. In it are these three code versions:

nr.c - uses code copied from a book called Numerical Recipes. It does not require any external libraries.
gsl.c - calls the GNU Scientific Library. You need to use the module command to link and load this library in the TACC environment.
lapack.c - calls the standard interface to LAPACK. This call may be linked against any compatible, optimized library that performs linear algebra. We will be using Intel's Math Kernel Library (MKL), which comes with the Intel compilers. The needed modules are already loaded by default on Frontera.

The Makefile has many targets. Here are the two targets you might want to use preferentially:

make - This makes all three versions of the program: nr, gsl, and lapack. If make fails, it is likely because it wants libraries that are not currently loaded.
make clean - This deletes all binaries you compiled for a clean start if you change your build procedure significantly, or if you change your source code.

To get started on Frontera, here are the steps to follow:

Add GSL to the currently-loaded modules: module load gsl
In the directory ~/ludecomp, type: make
Submit the job.sh script: sbatch job.sh, or if you want to designate an account different than your default Frontera account, sbatch -A <account name> job.sh
View the results of your runs: less results.txt

Now take a minute to look at the codes and evaluate each based on these criteria:

How many lines of code did it take to implement the solution? (Hint: run wc -l *.c)
How fast does it run? Look at results.txt when it finishes.
For each version, how hard would it be to swap in a different algorithm by, for instance, substituting an iterative solver, or using a sparse-matrix solver?
Can any of these codes run multithreaded? Can they run distributed, using MPI? You may need to Google the libraries to figure this out.

We've already seen what the compiler can do with -g, the debug suite of options. Next let's try the optimization options, to assess how they affect running times.

Edit the first few lines of the Makefile to add some compiler flags. Some possible FFLAGS are listed below.
Compile the three codes: make
Submit the codes to the scheduler: sbatch -A <account name> job.sh
Again examine results.txt.
Try some other choices of compiler and optimizations and see what is fastest. For the codes that call libraries, how does your choice of options affect the performance? (Remember, you're not compiling the libraries!)

Here are some compiler options to try:

-O3 - Tries harder to optimize code compared to -O2. Typically the code will be marginally faster, but it may not be, and the output may no longer be correct (remember to check).
-ipo - Looks for inter-procedural optimizations.
-xCORE-AVX512 -qopt-zmm-usage=high - Compiles an executable suited for heavy matrix computations on AVX-512 processors.
-g - Produces better debugging information.
-fp-stack-check - Generates extra code after every function call to catch certain errors in the floating point stack, at the cost of speed.
-qopenmp - Multithreads the code according to OpenMP directives, if any are present in the source code.

Here is one option not to try, as it does not make proper use of Frontera's resources:

-fast - Is an abbreviation for: -O3 -xHost -ipo -static -no-prec-div -fp-model fast=2. TACC discourages the use of the -fast option on Frontera, because "static" means that every process in a parallel job contains its own copy of any instructions linked from libs, with no code sharing. (Feel free to try the other options without -static.)

Extra credit: The MKL library has built-in support for OpenMP multithreading. To enable it, you simply set the following environment variable:

$ export OMP_NUM_THREADS=56

Try this! Uncomment the above line in job.sh, and submit the job. Does the time improve or not, when compared to leaving this variable unset? Note, the Frontera User Guide says you can safely set this variable to be the maximum number of hardware threads in a node, as MKL will automatically chooses an optimal thread count that may be less than this ceiling. What happens if you try a different number of threads (especially a small number of threads)?

Remember, Frontera SKX nodes have 2 processors with 56 cores and 56 hardware threads in total (as hyperthreading is not enabled).
On Frontera, the default value of OMP_NUM_THREADS is 1, so that when 56 MPI processes share a node they won't create chaos by forking 56xN threads.
Note that job.sh allows you to specify the size of the square matrix being factored (see export MATRIX=). You can experiment with this to see how matrix size affects the results that depend on OMP_NUM_THREADS.

Back