An MPI environment consists of multiple concurrently running tasks, with communication between these tasks mediated by MPI software and some network infrastructure (Omni-Path on Stampede2, InfiniBand on Frontera). Tasks initiate communication in this environment through the MPI API. This page describes how Slurm is (and is not) involved in establishing a job consisting of communicating MPI tasks.

Invoking MPI tasks

Some Slurm installations use an MPI plugin so that Slurm itself can launch tasks appropriately for an MPI implementation such as Open MPI, MVAPICH2, Intel MPI, etc. Other installations might expect users to launch their MPI tasks via a launcher that comes with the particular MPI implementation. Usually, this implementation-specific launcher is called mpiexec or mpirun (and note, one is often softlinked to the other).

On Stampede2 and Frontera, the preferred MPI launcher is TACC's own ibrun, which is an all-purpose front end to several different MPI implementations. The simplest invocation of the launcher is of the form:

Let's take a look at what happens in the typical case where an MPI application is run via ibrun in a batch submission:

  1. User submits a batch script, which is put into the queue.
  2. Once the resources are allocated, Slurm executes the batch script on one node. This batch script has access to all the typical environment variables that Slurm defines.
  3. The ibrun line in the script is encountered and executed. ibrun examines the Slurm environment variables to determine the node list in the allocation, as well as parameters such as the number of total tasks. It uses this information to launch the required number of tasks using the selected MPI implementation.
  4. Once all MPI processes have terminated, the batch script moves on to the next line of execution, or terminates if there are no more lines.

Recall that MPI tasks are separate processes. Typically, each process inherits the same environment that was present in the batch script that launched the tasks. However, an extra variable or two might be uniquely defined in the environment of each process. An example is MPI_PROCID, which is equivalent to the MPI rank of the current task.

Thus, in the absence of the special MPI plugin, Slurm is only indirectly involved in launching MPI processes. Slurm's responsibility in this case does not extend beyond defining the required environment variables and executing the batch script. It is the user's invocation of an MPI launcher (e.g., ibrun) that actually initiates the MPI processes on the nodes of the allocation.

For more information on how to launch MPI applications from Slurm jobs at TACC, together with sample batch scripts, refer to the Stampede3 User Guide or the Frontera User Guide.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement