Slurm's srun may be used to distribute independent tasks among allocated resources, enabling parameter sweeps or embarrassingly parallel use cases for non-MPI code. This technique uses the Slurm parameters -N and -n in the same manner as a parallel MPI job to achieve the same kind of distribution of tasks, only without an MPI environment. Hence, srun is useful for concurrent serial or threaded (shared memory) applications.

Much like the scripted series of jobs described earlier, srun launches a series of identical tasks that differ only by their environment variables. Each task has access to variables that indicate the number of tasks as a whole, as well as the "rank" of a given task within the set of all tasks.

Usage

As we saw in the discussion of srun's execution model previously, the key to running parallel tasks with srun is to use the -N and -n (or --tasks-per-node) parameters to decide the number of nodes, and number of processes per node:

In this example, srun will allocate two nodes, and launch two instances of my_program on each node.

Each running instance of a program launched via srun will have a variable SLURM_PROCID defined in its environment that uniquely identifies which task it is. The values of SLURM_PROCID range from 0 to the number of running processes minus 1. Similarly, SLURM_NTASKS contains the total number of tasks (equivalent to the value specified by -n). By using these two variables, it is possible for tasks launched this way to divide up a workload, or implement parameter sweeps.

Runtime Considerations

In some Slurm installations, the srun command can be executed from within a batch script or on the command line of a compute node. TACC, however, only allows execution of srun from the command line on a login node. This means there are a few additional runtime considerations to be aware of when using the command line method:

  • srun is a synchronous command that blocks the shell on the login node until the job runs and the tasks are complete. This is the opposite of sbatch, which is an asynchronous command that returns immediately, leaving the job to be queued and run on its own. For this reason, sbatch is the recommended way to get a Slurm allocation for long-running scripts.
  • srun by default combines the stdout and stderr from all the parallel tasks with its own stdout and stderr and sends the merged streams to your terminal. If you want to save these streams into one or two files, put redirects (or a pipe or a tee) at the end of the srun command line as you normally would. Alternatively, you can save the stdout and stderr from the parallel tasks (only) into two separate files using the -o and -e flags, just as you would with sbatch.
  • In general, the environment on the login node is inherited by whatever tasks srun executes, including the unique task of starting an interactive session (using --pty). If you need to set different environment variables (e.g., OMP_NUM_THREADS) for the tasks, you can set these temporarily prior to invoking srun, or you can create a wrapper script to be executed by srun in which you set the required environment variables, then invoke the executable.
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement