Cornell Virtual Workshop > Hybrid Programming with OpenMP and MPI > Hybrid Program Types

Exercise

In this exercise, we will work with a hybrid program that makes multithreaded MPI calls from inside an OpenMP parallel section. The short program is a fully functioning version of an example presented earlier in this topic. Our objective is to explore the effects of KMP_AFFINITY on the placement of threads from each MPI process.

You may find it helpful to refer back to a previous page in this roadmap that explains the effects of the scatter and compact thread affinity patterns provided through the KMP_AFFINITY variable.

Preliminaries

You will need to download two codes. You can do this by copying the text you will find linked to the names of these codes—mpimulti.f90 and mycpu.c—or by issuing the following commands on a Frontera or Stampede3 login node:

wget https://cvw.cac.cornell.edu/hybrid-openmp-mpi/hybrid-program-types/mpimulti.f90
wget https://cvw.cac.cornell.edu/hybrid-openmp-mpi/hybrid-program-types/mycpu.c

The code mycpu.c merely provides a Fortran interface to the sched_getcpu() function, a C-only API that returns the index of the cpu upon which the thread is currently running. It is not necessary to compile mycpu.c with an MPI wrapper script, but it is convenient to do so anyway. Compile and link mycpu.c and mpimulti.f90 using the default Intel compilers and MPI implementation:

mpicc -c mycpu.c
mpif90 -qopenmp mpimulti.f90 mycpu.o -o mpimulti

Specific instructions for Frontera are given first, followed by instructions for Stampede3.

Frontera Instructions

The mpimulti program takes such a trivial amount of time to run that it is fine to run it interactively on a Frontera development node, even with 56 total threads, or one thread for every core. Here is how to request the interactive job:

idev -A <###> -N 1 -n 2

In the above, <###> represents the name of your specific account or project on Frontera. The -N 1 -n 2 arguments mean that 1 node will be assigned to your job, and only 2 (or fewer) MPI processes can be started on that node when an ibrun command is issued. The -p development flag is implied here, also, as the development queue is the default for the idev command.

Once you get a command prompt, use ibrun to launch a pair of mpimulti processes with 28 threads per process:

export OMP_NUM_THREADS=28
ibrun -n 2 ./mpimulti | sort -f

As shown, it's advisable to pipe the output of the program through sort -f to make it easier to spot patterns. First, notice that some cpu values may occur more than once in the output. This is because on CLX, threads are just scheduled on the least loaded cpus on the node, even if some other threads have been assigned there already.

If you look a bit more carefully, you will see that all sending threads (on rank 0) are located on even-numbered cpus, while all receiving threads (on rank 1) are on odd-numbered cpus. This pattern makes sense if you assume that (a) each rank is given affinity to one of the two sockets, and (b) cpus 0, 1, 2, 3,... are located on alternating sockets.

To verify this numbering scheme of the cpus, we can scan the file /proc/cpuinfo, taking note of the physical id (socket) for each of the 56 cpus (which are unfortunately labeled as "processors" in /proc/cpuinfo):

cat /proc/cpuinfo | grep "processor\|physical id" | less

Our hypothesis is confirmed. We're now ready to experiment with thread affinity. An easy way to do this is to try different settings of the KMP_AFFINITY environment variable. Let's look at how the above pattern of thread assignments—which is somewhat arbitrary, apart from socket number—is affected when KMP_AFFINITY is set:

export KMP_AFFINITY=scatter
ibrun -n 2 ./mpimulti | sort -f
export KMP_AFFINITY=compact
ibrun -n 2 ./mpimulti | sort -f

Evidently, both patterns are effective at tying each thread to a different cpu, as no cpu id appears twice in the sorted lists. Surprisingly, though, the pattern of the thread assignments looks the same in the two cases. This is because hyperthreading is disabled on Frontera, which means that a "cpu" is equivalent to a core. Hyperthreading would make it appear that there are two virtual cpus per core, with the result that the "compact" pattern would try to assign pairs of threads to cores, as the Stampede3 exercise (below) makes clear.

At first glance, the numerical assignment of threads to cores looks as arbitrary as it did prior to setting any affinity. But an interesting motif emerges if you look at the sorted output in groups of 7 lines.

Stampede3 Instructions

Let's gain interactive access to a Stampede3 Skylake node and run our tests there, using 2 MPI tasks for the 48 cores on the node:

idev -p skx-dev -A <###> -N 1 -n 2

Be sure to use -p skx-dev argument as shown to specify the appropriate queue for the Skylake development nodes. When you get a command prompt, follow these steps:

export OMP_NUM_THREADS=24
ibrun -n 2 ./mpimulti | sort -f

Notice that some cpu values may occur more than once in the output. This is because on SKX, threads are just scheduled on the least loaded cpus on the node, even if some other threads have been assigned there already.

To verify this numbering scheme of the cpus, we can scan the file /proc/cpuinfo, taking note of the physical id (socket) for each of the 2*48=96 cpus. Why are there 96? Hyperthreading is enabled on the Stampede3 SKX nodes, which means that every physical core provides two "hardware threads", each of which is counted as a cpu. Even more confusingly, these "cpus" as labeled as "processors" in /proc/cpuinfo. So this is the sequence of commands we want:

cat /proc/cpuinfo | grep "processor\|physical id" | less

Our hypothesis is confirmed. Let's now look at how the above thread pattern—which is somewhat arbitrary, apart from socket number—is affected when KMP_AFFINITY is set:

export KMP_AFFINITY=scatter
ibrun -n 2 ./mpimulti | sort -f
export KMP_AFFINITY=compact
ibrun -n 2 ./mpimulti | sort -f

The "scatter" pattern clearly ties each thread to a different core, as no cpu id appears twice in the sorted lists. In contrast, the "compact" pattern places two threads on every occupied core. We know this to be true because the output includes cpus with high indexes (48-95), corresponding to the second of the two hardware threads on the SKX cores. Furthermore, if you look closely at the output, you will see that a high-numbered cpu always differs by 48 from one of the low-numbered ones. (The precise mapping of threads to cores is not obvious in either case, but a pattern emerges if you look at groups of 6.)

Thus, when OMP_NUM_THREADS=24 and KMP_AFFINITY=compact, the OpenMP threads are occupying the hardware threads on just a subset of the physical cores. In other words, "compact" on a Stampede3 SKX node means that 2 OpenMP threads are packed onto the minimum number of physical cores (12 for each MPI rank, in this case). You can confirm this by subtracting 48 from any high-numbered cpu in the output.

What will happen if you set OMP_NUM_THREADS equal to 96, the total number of hardware threads, rather than the total cores? Try it and find out!

Back