Once you have inserted MPI calls, the program must be compiled and executed in a parallel environment. Details on compiling and running a parallel program with the Slurm batch system are found in the Cornell Virtual Workshop roadmaps for the Stampede2 Environment and the Frontera Environment. We present just a brief synopsis of what is involved in this section.

Even though MPI was designed primarily for parallelism, it can be used for parallelizing codes on an architecture — like that of Stampede2 or Frontera — that features distributed pools of shared memory. In the simplest (and not necessarily scalable) approach, the application treats the system no differently than if all the memory were distributed, so all processes maintain their own separate memory and communicate through MPI whether they are local or remote. On a typical Intel Xeon processor, this can at least have the advantage of helping to localize the working set for each process to the caches and memory controller that are associated with a single core. This approach is sufficient for the purposes of this tutorial.

However, if your ultimate goal is large-scale parallelism on a cluster of many-core processors, such as Stampede2 or Frontera, there is more to consider as you move toward using MPI in production. It may be necessary to add a second layer of parallelism in order to allow your MPI processes to use the available hardware threads. To achieve this goal, you should consider using MPI in conjunction with OpenMP. For details on how to do this, see the Hybrid topic.

If your code relies heavily on linear algebra operations, then multithreading may be easier than the insertion of OpenMP directives. Intel's Math Kernel Library (MKL) is already compiled with OpenMP, so multithreading can be enabled just by setting an environment variable. This can also be done with code written in Python, R, and MATLAB. There are further instructions on this in the Stampede3 User Guide and the Frontera User Guide.

There are a few additional points to keep in mind specific to Stampede2:

  • If you are using the Intel nodes, parallelizing your code with MPI alone is unlikely to be the best approach for utilizing all 272 hardware threads per processor. Experience has shown that assigning 1 or 2 MPI processes to each KNL tile — where a tile contains 2 cores, or equivalently, 8 hardware threads — may result in fewer memory conflicts and better performance overall.
  • If you are using Skylake or Ice Lake nodes, then creating a hybrid MPI/OpenMP code is less crucial until you start using hundreds of nodes. Assuming you wish to assign an MPI process to every physical core, startup eventually becomes an issue for any pure-MPI code. Hybrid programming is likely to make your code more scalable, especially during startup.
  • All multi-node parallel applications necessarily rely on Stampede2's Omni-Path interconnect. Both the Intel MPI and MVAPICH2 implementations work with Omni-Path, but they work best with different settings of the PSM2_KASSIST_MODE environment variable. See the programming and performance section in the Stampede2 User Guide or the Frontera User Guide.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement