This exercise tests the MPI-IO parallel write and read speeds to a Lustre file system, then explores how these speeds can be affected by the striping pattern. The test programs call MPI-IO routines for doing parallel I/O. For each file linked below, either use the link to download the file directly, or copy the code text and paste it into a file that has the specified name.

C lab files:

To compile the codes, make them with make at the command line.

MKRANDPFILE

The -l option specifies how many blocks of integers each process will write. By default a block is 4 MB. Therefore, 20 blocks comes to 80 MB. If 16 processes each write 80MB, the total file size for the test will be 1280 MB. The command for executing the code at TACC is:

If you want to try running the above command, be aware that you will likely have to submit an interactive batch job to a compute or development node of your system (using TACC's idev, e.g.), as HPC centers generally discourage their users from running MPI executables on a login node.

XRANDPFILE

This program simply reads the file you wrote with mkrandpfile.

Compare the write speed to the read speed you obtained previously. Remember to clean up the output file:

The lab exercise consists of submitting a batch job using the supplied script and examining the results. You may need to edit the script to supply a valid account number. You might also need to modify the Slurm partition for the batch job and add a line defining $SCRATCH as a writable directory within your Lustre file system (the latter is unnecessary at TACC).

Notice the inherent difference between Lustre write and read speeds. Then see which stripe pattern led to the best rate for writes and reads, respectively. (Type less mpiio.o* and search for "Rates" in the output.) You may want to tweak the parameters to lfs setstripe and resubmit the job to see if you can do better!

Bonus Exercise

Try running mkrandpfile and xrandpfile with more MPI processes to take advantage of all the cores on the node. For example, to do this interactively using 32 MPI tasks (and note, the output file will be larger):

This should run faster on a Stampede2 or Frontera node equipped with an Intel Xeon Scalable Processor, for instance, which can support more workers and a higher degree of vectorization as compared to nodes with older Xeon processors. Test the performance effects of these characteristics more thoroughly by trying the following:

  1. Increase the number of MPI tasks from 16 to 32 in the job script. (Does it help to add even more? What about 48 tasks on Stampede2, or 56 on Frontera?)
  2. Assuming your mpicc is a front end to Intel's icc, add -xCORE-AVX512 -qopt-zmm-usage=high to the compiler options in the Makefile, then do make clean; make. (If your mpicc is a front end to gcc, you should use -march=skylake-avx512 -mprefer-vector-width=512 instead.)
Credit

The MPI-IO codes are from a former course at Indiana University . The original source code is archived here: mkrandpfile, xrandpfile.

Note, in order to access large files on a 32-bit system, two additional macros would have to be defined in the mpicc command, -D_FILE_OFFSET_BITS=64 and -D_LARGEFILE64_SOURCE. However, these macros are transitional and are unnecessary on modern 64-bit systems (as explained at unix.org).

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement