Even though MPI was designed primarily for distributed-memory parallelism, it is most often used in HPC environments that feature distributed pools of shared memory. In the simplest (and not necessarily scalable) approach, a parallel application treats such an architecture no differently than if all the memory were distributed, so that all processes maintain their own separate memory and communicate through MPI whether they are local or remote. This is called the "pure-MPI" approach. On typical set of HPC nodes, this may at least have the advantage of helping to localize the working set for each process to the caches and memory controller that are associated with a single core.

The pure-MPI approach will be sufficient for the purposes of the present tutorial. But if your ultimate goal is large-scale parallelism on a cluster of many-core processors, such as Frontera or Vista, there is more to consider as you move toward using MPI in production. It may be necessary to add a second layer of parallelism in order to allow your MPI processes to use the available hardware threads, or to improve overall scalability. To achieve this goal, you should consider using MPI in conjunction with OpenMP. For details on how to do this, see the Hybrid Programming with OpenMP and MPI roadmap.

Multithreading may be easier than manually inserting OpenMP directives, in certain cases where your code relies heavily on linear algebra operations. Intel's Math Kernel Library (MKL) is already compiled with OpenMP, so multithreading can be enabled just by setting an environment variable. This can even be done with code written in Python, R, and MATLAB, provided that the underlying application libraries (e.g., NumPy) depend on MKL. For Frontera, you can find further instructions on this in the Frontera User Guide.

Here are a few additional points to keep in mind for HPC systems:

  • Startup eventually becomes an issue for any pure-MPI code that runs on multiple nodes with dozens of cores. Therefore, it may become crucial to create a hybrid OpenMP/MPI code when using hundreds of such nodes. Hybrid programming is likely to make your code more scalable, especially during startup.
  • Still, it is often desirable to assign 2 or more MPI processes to each node, each of which forks multiple threads. This may result in fewer memory conflicts and better performance overall.
  • Achieving top performance necessarily depends on the details of the architecture, even down to the processor level. See the programming and performance section of the Frontera User Guide or the Vista User Guide for best practices on those systems.

Of course, any MPI program must be compiled and executed in the context of a particular parallel environment. On the pages that follow, we just present a synopsis of what is involved. More specific guidance on how this is done on Frontera, e.g., can be found in the Cornell Virtual Workshop roadmap Getting Started on Frontera, particularly the sections on environment modules, compiling, and running programs with the Slurm batch system. The user guides for Frontera and Vista are also relevant.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)