Cornell Virtual Workshop > Introduction to Advanced Cluster Architectures > Three Dimensions of Scaling

Scale Up: OpenMP Exercise

This exercise glances at a possible strategy for scaling out and up at the same time. OpenMP can be combined with MPI in a single code to produce a parallel application that may be suitable for a cluster of Xeon Scalable Processors (or similar multi-core chips). Such a code is termed a hybrid code. You might like to learn more about OpenMP and Hybrid Programming with OpenMP and MPI in other CVW tutorials.

The source for this lab is hello_hybrid.c. As in previous exercises, you may download the code via that link, or copy-paste it into a file on the target machine. Compiling it is very similar to ones we have used already; as you might expect, it merely combines the earlier MPI and OpenMP commands:

$ mpicc -xCORE-AVX512 -O3 -qopenmp hello_hybrid.c -o hello_hybrid

Again, we run the code by using Slurm's sbatch command to submit a script to the development queue on Frontera. Alternatively, the commands in the following script can be entered interactively, skipping the Slurm options. In this instance, the interactive session on Frontera should be started with idev -N 2 -n 2 so that just 1 MPI process is running on each of 2 nodes.

Remember to edit the above batch script (which you can name run_hello_hybrid.sh, perhaps) to change the account number to one that you are allowed to use. Output from your job will be saved to a file named slurm-######.out. Notice that in the second run of the code, after we have reset OMP_NUM_THREADS to 56 in the shell environment on one node, the new value of that variable is honored by both MPI processes, even though they are running on separate nodes. If you wanted to run this example on Stampede3 instead, you could modify the script to specify 48 threads per node in the skx-dev queue, with an allocation account number appropriate for that machine.

Thus we see that this batch script initiates a total of 2 MPI ranks, with one rank per node, and with each rank spawning 4568 threads, one for each core on an CLX node. So why not just launch 112 MPI processes? Either way, the cores won't be oversubscribed. But it turns out there is far less overhead involved in launching 2 MPI processes and having them fork 56 threads each, as compared to launching 112 separate MPI processes. The difference in the initialization time won't matter much on 2 nodes, but on hundreds or thousands of nodes, it can lead to a huge delay.

Furthermore, the MPI-only approach might use too much memory, as mentioned previously. OpenMP opens up the possibility for the parallel workers to share memory within a node. This may create opportunities for using memory more efficiently, or perhaps eliminating a layer of message passing.

A point of caution with hybrid codes is that special considerations come into play when MPI is called from within an OpenMP parallel construct. (Our example code does not do this.) Details are covered in Hybrid Programming with OpenMP and MPI, where you will find a thorough treatment of the whole subject.

Back