These exercises will introduce you to using OpenMP for parallel programming.

Quick links to the exercises below:
  1. Setup
  2. OMP Hello World
  3. Worksharing Loop
  4. OMP Functions
  5. Hand-Coding vs. MKL

Setup

While designed for Stampede2, they can be easily modified to work on other systems. To begin on Stampede2, log onto a login node, then use the following Linux commands to fetch and extract the archive:


ssh -X <user-name>@stampede2.tacc.utexas.edu
wget http://cvw.cac.cornell.edu/openmp/openmp-constructs/lab_openmp.tar
tar xvf lab_openmp.tar

The makefile that comes with these exercises is set up to use the Intel compilers and (for the final exercise) Intel Math Kernel Library. These are loaded by default on Stampede2.

The exercises are very short, but certain exercises involve performance measurements, so they are best done on a dedicated node through the batch system. You can do this by writing short batch scripts of your own, or by starting an interactive job with "srun" like this:


srun -p skx-dev -t 0:30:00 -n 1 -N 1 --pty /bin/bash -l
Code for exercises:

If you are working on a different system, you can download all the files at once, as a single tar archive: lab_openmp.tar. Alternatively, you can download the files individually in the list below.

Here are the individual programs, along with their makefile:


OMP Hello World

Look at the code in hello.c and/or hello.f90. This code simply reports OpenMP thread IDs in a parallel region. Compile hello.c or hello.f90 using the makefiles provided and execute first with 3 threads and then with 1 to 16 threads. (If you want Fortran, substitute hello_f90 for hello_c below.)


make hello_c            
export OMP_NUM_THREADS=3
./hello_c               
make run_hello_c        

Worksharing Loop

Look at the code in file daxpy.f90. The nested loop repeats a simple DAXPY type of operation (double-precision ax+y, scalar times vector plus vector). It is repeated ten times in order to gather statistics on performance. Parameter N determines the size of the vector: N=48*1024*1024 is the default. A more detailed comparison will be done in the batch job. (If you like, the makefile lets you "make run_daxpy" interactively to try different numbers of threads.)


    make daxpy
    export OMP_NUM_THREADS=3
    ./daxpy
    # the next step is needed for the final exercise
    make run_daxpy

OMP Functions

Look at the code in work.f90. Threads perform some work in a subroutine called pwork. The timer returns wall-clock time. Compile work.f90 and run it with one set of threads to verify that it built properly. Then you can run it with other numbers of threads to see how much speedup it gets.


make work
export OMP_NUM_THREADS=3
./work
# the next step is optional
make run_work

Now look at work_serial.f90. We no longer use omp_lib, and numeric values are substituted for the calls to OMP_ functions. The OpenMP directives are ignored because the code is not compiled with OpenMP. As expected, this code runs with nearly the same speed as the work.f90 code with 1 thread. But since the computation time is relatively short, overhead due to OpenMP is appreciable, even though all threads are forked at the beginning and the parallel region contains all the work.


make work_serial
./work_serial
export OMP_NUM_THREADS=1
./work

Hand-coding vs. MKL

Look at the code in file daxpy2.f90. The nested loop performs a DAXPY operation for each outer loop. The DAXPY routine comes from the Intel MKL library, which is already parallelized with OpenMP (!). All you have to do is change the value of OMP_NUM_THREADS.


make daxpy2
export OMP_NUM_THREADS=3
./daxpy2
# the next step is needed to produce plots
make run_daxpy2

To make plots of the data from run_daxpy2 and run_daxpy, exit from your dedicated, interactive node and run the following on a login node:

make plot_daxpys

The plots will help you compare the performance of MKL's DAXPY with that of the hand-coded OpenMP version of DAXPY that you ran earlier, for varying numbers of threads. (If you get an "unable to open display" message, try logging out and logging in again, and be sure to give the -X option to ssh.)

Note, the number of OpenMP threads can exceed the number of physical cores.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement