SMP Sockets
Hybrid batch script for 2 tasks/node, 24 threads/task
Example script uses 5 nodes
Line 2 : Specify total MPI tasks to be started by batchLine 3 : Specify total nodes equal to tasks/2 (so 2 tasks/node)Line 5 : Set number of threads for each process-
Line 6 : PAMPering at process level; easiest to invoke a script to manage affinity- consider it as not strictly necessary, as many MPI launchers will PAMPer by default
- TACC's
task_affinity
script pins tasks to sockets, ensures local memory allocation - if
task_affinity
isn't quite right for your application, use it as a starting point
...
#SBATCH -n 10
#SBATCH -N 5
...
export OMP_NUM_THREADS=24
ibrun task_affinity ./a.out
...
#SBATCH -n 10
#SBATCH -N 5
...
setenv OMP_NUM_THREADS 24
ibrun task_affinity ./a.out
What does task_affinity do?
It tries to make judicious use of the available NUMA nodes by having processes allocate memory on the most local NUMA node. If you want to see how task_affinity
works, you can inspect it by running cat `which task_affinity`
on a login node , or perhaps in the script for a batch job, or in an interactive session (idev -p <queue_name>
with the appropriate queue specified).
It works similarly to the following script, which extracts local MPI rank and size, sets the numactl options per process, etc.
- The script is meant to be run by an MPI startup utility, e.g., TACC's
ibrun
- It therefore needs to be made executable (e.g., with
chmod 700 numa.sh
) - Assumption:
ibrun
always assigns MPI ranks sequentially, filling slots on one node before moving on to the the next - Again, specify total nodes equal to tasks/2 (so 2 tasks/node)
- The script pertains to MVAPICH2 and therefore requires
module load mvapich2
(Stampede3) ormodule load mvapich2-x
(Frontera) to be run - For Intel MPI, I_MPI_PIN_ variables would have to be parsed
#!/bin/bash
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
#LocalRank, LocalSize, Socket
LR=$MV2_COMM_WORLD_LOCAL_RANK
LS=$MV2_COMM_WORLD_LOCAL_SIZE
SK=$(( 2 * $LR / $LS ))
[ ! $SK ] && echo SK null!
[ ! $SK ] && exit 1
numactl -N $SK -m $SK ./a.out
#!/bin/csh
setenv MV2_USE_AFFINITY 0
setenv MV2_ENABLE_AFFINITY 0
#LocalRank, LocalSize, Socket
set \
LR=$MV2_COMM_WORLD_LOCAL_RANK
set \
LS=$MV2_COMM_WORLD_LOCAL_SIZE
@ SK = ( 2 * $LR / $LS )
if ( ! ${%SK} ) echo SK null!
if ( ! ${%SK} ) exit 1
numactl -N $SK -m $SK ./a.out
In principle, the above script also permits more than one MPI rank to occupy each socket. However, to prevent ranks from piling up on the same core, it would be necessary to add something like -C `expr 2 \* $LR + $SK`
to the final command above. The task_affinity
script handles this issue a bit better: when it assigns core affinity, it selects core numbers to ensure that the MPI ranks are not assigned to physically adjacent cores.
Again, none of this scripting is strictly necessary, because MVAPICH2 already includes features that automatically manage process affinity and memory pinning for hybrid applications (see, e.g., MV2_CPU_BINDING_POLICY and MV2_HYBRID_BINDING_POLICY in the MVAPICH2 User Guide). Thus, in the above example, we actually had to turn off the default behavior of MVAPICH2 in the first few lines.