Subdividing Communicators
The goal of subdividing a communicator can also be accomplished using process groups.
- MPI_Comm_group
- extract the process group associated with the input communicator
- MPI_Group_incl
- make a new group from selected members of the existing group (e.g., members in the same row of our 2D layout)
- MPI_Comm_create
- form a communicator based on the input group
Ideas leading to the sample code below:
- Imagine processes are numbered left-to-right, then down
- Get the base group corresponding to MPI_COMM_WORLD
- Find my global rank, then find my specific row in the 2D layout
- Loop: make a row_list containing all ranks in row 1, then row 2,...
- Create a group from row_list and a communicator for the group
- Retain the communicator only for the row I'm in
/* Construct a communicator shared by processes in the same row*/
MPI_Group base_grp, grp;
MPI_Comm row_comm, temp_comm;
int row_list[NCOL], irow, myrank_in_world;
MPI_Comm_group(MPI_COMM_WORLD,&base_grp); //get base
MPI_Comm_rank(MPI_COMM_WORLD,&myrank_in_world);
irow = (myrank_in_world/NCOL);
for (i=0; i<NCOL; i++) row_list[i] = i;
for (i=0; i<NROW; i++){
MPI_Group_incl(base_grp,NCOL,row_list,&grp);
MPI_Comm_create(MPI_COMM_WORLD,grp,&temp_comm);
if (irow == i) *row_comm=temp_comm;
for (j=0; j<NCOL; j++) row_list[j] += NCOL;
}
You might wonder why every process is involved in the creation of all NROW groups, even though any given process belongs to only one of them. This is because the MPI specification states that the call to MPI_Comm_create must be executed by all processes in the input communicator (in our case, MPI_COMM_WORLD), and that all processes must pass the same value for the group argument (grp), even if they do not belong to the new group. This can be a dire problem with a very large number of processes, such as are found in petascale systems, so MPI_COMM_CREATE_GROUP was introduced in MPI-3 to alleviate this problem. It only requires the processes involved in the "group-to-be" (a subgroup of the parent communicator) to call MPI_COMM_CREATE_GROUP. This can be referred to as a non-collective operation, since it does not collectively involve every process on the communicator. This is an essential ability for enabling modular communications on a very large parent collective, like those found in petascale or exascale installations.
int MPI_Comm_create_group(MPI_Comm comm, \
MPI_Group group, int tag, MPI_Comm *newcomm)
The tag argument here is used to distinguish between multiple calls of MPI_Comm_create_group, and won't interfere with calls to other functions using tags, in particular point-to-point functions.