What is MKL? The Intel oneAPI Math Kernel Library supports both Fortran and C interfaces. It includes functions for BLAS levels 1-3, LAPACK, FFT, the Vector Math Library (VML), and others. MKL is optimized by Intel for all current Intel architectures. As we'll see in the exercise, it incorporates shared-memory parallelism via OpenMP if desired; just set OMP_NUM_THREADS or MKL_NUM_THREADS > 1.

This library comes bundled with the Intel compilers. If you switch to a non-Intel compiler, you must re-load MKL explicitly:


$ module swap intel gcc
$ module load mkl
$ module help mkl

Assuming the correct environment module is loaded, you proceed to compile and link with just an extra flag or two. The link line requires special attention when using the GNU compilers with MKL; see the Frontera User Guide or the Intel oneAPI Math Kernel Library Line Advisor website. But if you use the Intel compilers, they already "know" where the MKL libraries and headers are located, and it is generally pretty easy to get MKL to compile correctly, as shown in the examples below. (For final linking, the flags -lpthread -lm may also be needed.)


$ icc   myprog.c   -mkl   # this means the same as -mkl=parallel
$ ifort myprog.f90 -mkl

Use -mkl-sequential if you do not wish to load the OpenMP-multithreaded version of MKL. More detailed compiler and linker flags may also be required for static linking or other non-typical MKL specifications. If the exact locations of any MKL files are needed, they will be found under $MKLROOT. TACC also provides the special environment variables $TACC_MKL_INC and $TACC_MKL_LIB to point to MKL's header and library files.

For MPI codes, you are likely to want to access the Intel compilers through mpicc or mpif90. The MKL flags are mostly unchanged—but have a look at the Frontera User Guide for the special case of ScaLAPACK. Note, -mkl-sequential may be the right choice for a code that relies on MPI.

Another significant math library is the Fastest Fourier Transform in the West (FFTW), a comprehensive set of C routines for computing discrete Fourier transforms. It uses a so-called cache-oblivious algorithm to obtain excellent performance regardless of the platform on which it runs. The routines can transform single- and multi-dimensional real and complex data of arbitrary input size. All the right stuff is built in for you: the Cooley-Tukey algorithm, the Prime Factor algorithm, Rader's algorithm for prime sizes, the split-radix algorithm, and so on. On Frontera, one can either make use of pre-built FFTW 2 and 3 libraries from their own modules, or use MKL, which implements FFTW-compatible interfaces.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement