MPI has long been fundamental to cluster computing. It is still the most important and familiar way to scale out a computation to larger sizes—and this fact is no less true on the Xeon SP nodes on Frontera and Stampede3. This lab takes a quick look at how to compile and run an MPI code for CLX/SKX through a simple "Hello, World" example. For specificity, the instructions are presented for Frontera, but they are nearly identical for Stampede3, and will work on any machine or cluster where MPI is installed and Xeon SP processors are available. If you are unfamiliar with MPI, you might want to consult our companion material consisting of the roadmaps grouped together under the MPI heading.

Begin by choosing a source code to download: mpi_hello.c or mpi_hello.f90, according to your preferred language. Either download the code by using a link to the file, or by copy-pasting, just as you did for the earlier compile-and-run exercise with OpenMP.

MPI codes are typically compiled through a standard wrapper. This wrapper ensures that the compiler receives the correct names of the MPI include files and libraries, together with the right paths to whatever is needed to compile and link the program. The only visible change compared to a non-MPI compilation is that the apparent name of the compiler is different; other options for the underlying compiler (e.g., Intel C or Fortran) look exactly the same. As described in Compiler Options for Xeon SP, we can use Intel's -x option to specify a target architecture; in this case we will compile for the CORE-AVX512 instruction set suitable for both CLX and SKX.

$ mpicc -xCORE-AVX512 -O3 mpi_hello.c -o mpi_hello
$ ### OR ###
$ mpif90 -xCORE-AVX512 -O3 mpi_hello.f90 -o mpi_hello

To confirm that the underlying compiler and MPI implementation are the ones you expect, try one or more of these commands:

$ module list
$ which mpicc
$ mpicc --version

You can submit the following script with Slurm's sbatch command to run the code in the development queue on Frontera. Alternatively, you can start an interactive session on a couple of Frontera compute nodes with idev -N 2 -n 112, then run the MPI code at the command prompt with just the ibrun command. On clusters other than Frontera and Stampede3, mpirun or mpiexec is generally equivalent to TACC's ibrun.

If you do submit the above as a batch script (you can name it run_mpi_hello.sh, perhaps), remember to edit the account number to change it to one that you are allowed to use. Output from your job will be saved to a file named slurm-######.out. The script starts 112 MPI processes in total, which is equal to the number of cores on the 2 CLX nodes that it requests. So expect a lot of "hellos" in the output!

The ibrun tool used to run the MPI job above is configured to support task affinity. If you run one MPI application at a time, the ibrun launcher will spread each node's tasks evenly across an CLX node's two sockets, with consecutive tasks occupying the same socket when possible. A similar process takes place when ibrun is used to launch MPI jobs on Frontera. The related issue of thread affinity is discussed as part of the Exercise at the end of this topic.

You could run the same example above on Stampede3, with some minor modifications:

  1. change the partition name from development to skx-dev
  2. reduce the number of processes from 112 to 96, to reflect the reduced number of cores on two Stampede3 SKX nodes
  3. change your allocation account number to enable you to run on Stampede3

In practice, you may not want to start one MPI rank on every core, especially if each rank uses a lot of memory. Often it is more effective to have fewer MPI ranks per node, each of which is multithreaded with OpenMP. And that is the motivation for the next exercise.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement