These exercises will introduce you to using OpenMP for parallel programming. While they are designed for Frontera and Vista at TACC, the files and instructions can be easily modified to work on other systems.

Quick links to the exercises below:
  1. Setup
  2. OMP Hello World
  3. Worksharing Loop
  4. OMP Functions
  5. Hand-Coding vs. MKL (with Intel compiler)

Setup

To begin on Frontera or Vista, log onto a login node, then use the following Linux commands to fetch and extract the archive:

The makefiles that come with these exercises are set up to use the Intel, NVIDIA, or GNU compilers. They also require either the Intel Math Kernel Library (MKL) on Frontera, or the NVIDIA Performance Library (NVPL) on Vista.

  • On Frontera, an Intel compiler module is loaded by default; on Vista, an NVIDIA compiler module is loaded by default.
  • Frontera does not have NVIDIA compilers, nor does Vista have Intel compilers.
  • If you would like to use GNU's GCC compilers instead, simply run module load gcc on either system. Even though GCC is already installed with the Linux OS on each system, best practice is to load a GCC module explicitly, in order to get a more recent GCC version.
  • On Frontera, any Intel module automatically gives access to MKL. If a GCC module is loaded instead, then module load mkl must be run in order to obtain access to MKL.
  • On Vista, an NVPL module is always loaded by default. The version of NVPL will change appropriately if a different compiler module is loaded, including any GCC module.

The following exercises are very short, but certain exercises involve performance measurements, so they are best done on a dedicated node through the batch system. You can do this by writing short batch scripts of your own, or by starting an interactive job with "idev" like this:

Code for exercises:

If you are working on a different system, you can download all the files at once, as a single tar archive: lab_openmp.tar. Alternatively, you can download the files individually in the list below.

Here are the individual programs, along with their makefile:

To build and run the files, you will be using the make command. By default, make reads makefile or Makefile in the current directory. Therefore, you can select one of the above makefiles by renaming it to Makefile (capitalizaton is preferred, as it puts the file at the top of your directory listing). Alternatively, you can use make -f to specify which makefile will be read.

Below, assume one of the makefiles is renamed to Makefile.


OMP Hello World

Look at the code in hello.c and/or hello.f90. This code simply reports OpenMP thread IDs in a parallel region. Compile hello.c or hello.f90 using the makefiles provided and execute first with 3 threads and then with 1 to 16 threads. (If you want Fortran, substitute hello_f90 for hello_c below.)


Worksharing Loop

Look at the code in file daxpy.f90. The nested loop repeats a simple DAXPY type of operation (double-precision ax+y, scalar times vector plus vector). It is repeated ten times in order to gather statistics on performance. Parameter N determines the size of the vector: N=48*1024*1024 is the default. A more detailed comparison will be done in the batch job. (If you like, the makefile lets you "make run_daxpy" interactively to try different numbers of threads.)


OMP Functions

Look at the code in work.f90. Threads perform some work in a subroutine called pwork. The timer returns wall-clock time. Compile work.f90 and run it with one set of threads to verify that it built properly. Then you can run it with other numbers of threads to see how much speedup it gets.

Now look at work_serial.f90. We no longer use omp_lib, and numeric values are substituted for the calls to OMP_ functions. The OpenMP directives are ignored because the code is not compiled with OpenMP. As expected, this code runs with nearly the same speed as the work.f90 code with 1 thread. But since the computation time is relatively short, overhead due to OpenMP is appreciable with work.f90, even though all threads are forked at the beginning and the parallel region contains all the work.


Hand-coding vs. Performance Library

Look at the code in file daxpy2.f90. The nested loop performs a DAXPY operation for each outer loop. DAXPY is a standard BLAS routine that is implemented in various high-performance libraries, including Intel MKL and NVIDIA NVPL. And it turns out that these two libraries are already parallelized with OpenMP (!). All you have to do is change the value of OMP_NUM_THREADS.

To make plots of the data from run_daxpy2 and run_daxpy, exit from your dedicated, interactive node and do the following quick run on a login node:

The plots will help you compare the performance of MKL's DAXPY with that of the hand-coded OpenMP version of DAXPY that you ran earlier, for varying numbers of threads. (If you get an "unable to open display" message, try logging out and logging in again, and be sure to give the -X option to ssh.)

Note, the number of OpenMP threads is allowed to exceed the number of physical cores, even though this might reduce or eliminate any further performance gains.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)