Hybrid batch script for 2 tasks/node, 24 threads/task

Example script uses 5 nodes

  • Line 2: Specify total MPI tasks to be started by batch
  • Line 3: Specify total nodes equal to tasks/2 (so 2 tasks/node)
  • Line 5: Set number of threads for each process
  • Line 6: PAMPering at process level; easiest to invoke a script to manage affinity
    • consider it as not strictly necessary, as many MPI launchers will PAMPer by default
    • TACC's task_affinity script pins tasks to sockets, ensures local memory allocation
    • if task_affinity isn't quite right for your application, use it as a starting point

... 
#SBATCH -n 10 
#SBATCH -N 5 
... 
export OMP_NUM_THREADS=24 
ibrun task_affinity ./a.out

... 
#SBATCH -n 10 
#SBATCH -N 5 
... 
setenv OMP_NUM_THREADS 24 
ibrun task_affinity ./a.out
What does task_affinity do?

It tries to make judicious use of the available NUMA nodes by having processes allocate memory on the most local NUMA node. If you want to see how task_affinity works, you can inspect it by running cat `which task_affinity` on a login node , or perhaps in the script for a batch job, or in an interactive session (idev -p <queue_name> with the appropriate queue specified).

It works similarly to the following script, which extracts local MPI rank and size, sets the numactl options per process, etc.

  • The script is meant to be run by an MPI startup utility, e.g., TACC's ibrun
  • It therefore needs to be made executable (e.g., with chmod 700 numa.sh)
  • Assumption: ibrun always assigns MPI ranks sequentially, filling slots on one node before moving on to the the next
  • Again, specify total nodes equal to tasks/2 (so 2 tasks/node)
  • The script pertains to MVAPICH2 and therefore requires module load mvapich2 (Stampede3) or module load mvapich2-x (Frontera) to be run
  • For Intel MPI, I_MPI_PIN_ variables would have to be parsed

#!/bin/bash
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
#LocalRank, LocalSize, Socket

LR=$MV2_COMM_WORLD_LOCAL_RANK

LS=$MV2_COMM_WORLD_LOCAL_SIZE
SK=$(( 2 * $LR / $LS ))
[ ! $SK ] && echo SK null!
[ ! $SK ] && exit 1
numactl -N $SK -m $SK ./a.out

#!/bin/csh
setenv MV2_USE_AFFINITY 0
setenv MV2_ENABLE_AFFINITY 0
#LocalRank, LocalSize, Socket
set \
LR=$MV2_COMM_WORLD_LOCAL_RANK
set \
LS=$MV2_COMM_WORLD_LOCAL_SIZE
@ SK = ( 2 * $LR / $LS )
if ( ! ${%SK} ) echo SK null!
if ( ! ${%SK} ) exit 1
numactl -N $SK -m $SK ./a.out

In principle, the above script also permits more than one MPI rank to occupy each socket. However, to prevent ranks from piling up on the same core, it would be necessary to add something like -C `expr 2 \* $LR + $SK` to the final command above. The task_affinity script handles this issue a bit better: when it assigns core affinity, it selects core numbers to ensure that the MPI ranks are not assigned to physically adjacent cores.

Again, none of this scripting is strictly necessary, because MVAPICH2 already includes features that automatically manage process affinity and memory pinning for hybrid applications (see, e.g., MV2_CPU_BINDING_POLICY and MV2_HYBRID_BINDING_POLICY in the MVAPICH2 User Guide). Thus, in the above example, we actually had to turn off the default behavior of MVAPICH2 in the first few lines.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement