Exercise
Demonstration of Amdahl's Law
In this exercise, we will see how the presence of serial (i.e., unparallelizable) work affects the scalability of a simple MPI code. What we'd like to find out is: does Amdahl's Law really hold true in practice?
Our starting point is a time-honored C code that calculates \(\pi\) to fairly high precision by performing a simple numerical integration. The chosen integrand is \(d(\arctan(x)) = dx/(1 + x^2)\). Conveniently, the definite integral of this function over the interval \(x=0\) to \(x=1\) equals \(\arctan(1)-\arctan(0) = \pi/4\). The algorithm we adopt is nothing more than the rectangle method, using the midpoint approximation; it's about the simplest one that can be imagined. To parallelize the job of summing all the areas of the tall, narrow rectangles, we have each worker skip through the collection of rectangles by a stride equal to numprocs. Thus, the partial sum computed by each worker involves 1/numprocs of the total number of rectangles. These partial sums are unique because each worker includes an offset equal to its rank.
The algorithm is not quite perfectly parallelized, though, for two reasons. First, the partial sums must be collected onto one of the workers (rank 0) to obtain the full sum. Second, we need to see the final output, so this one worker must write it to the terminal. These steps cannot be made to go faster by dividing up the work among more workers; in fact, we imagine that the communication step will go slower with more workers.
Accordingly, we would like to investigate the performance impact of the call to MPI_Reduce (communication time) as well as the impact of the I/O performed by rank 0 (serial time). The program that enables us to experiment with this is pi_scale.c. The algorithm is exactly the same as before, but calls to MPI_Wtime() provide timings of the different code sections. Also, rather than use all the workers that are available, the code does the test repeatedly as the number of workers increases from 1 to numprocs. Each time, the job is divided up among more workers; when all the ways of dividing the job have been tried, the code outputs a summary, which includes the observed speedup as a function of the number of workers. It also prints the expected speedup based on Amdahl's Law.
Instructions
Here is what you need to do to try the code for yourself:
-
Follow the link to pi_scale.c. In a command window on Frontera, run:
Copy the text of the code from your browser over to the command window, then exit the$ cat > pi_scale.c
cat
command withctrl D . Now you have your C source file. -
Compile the C source code:
$ mpicc pi_scale.c -o pi_scale
-
Start a one-node interactive batch job so you can play with the code:
$ srun -p development -A YOURACCT -t 0:30:00 -N 1 -n 56 --pty /bin/bash -l
-
When your job starts, run the executable as you would any MPI code:
$ ibrun pi_scale
- Input different numbers of intervals for the integration, n, and see how it affects the total run time and parallel efficiency. Remember, increasing n increases the total amount of work, but all the additional work is completely parallelizable.
-
Don't forget to end your batch job by exiting the interactive shell (
exit
orctrl D ) when you're done.
Here are some questions for you to think about, as you look over the results of your tests:
- Does Amdahl's Law provide a good estimate of the parallel performance of this code?
- As n increases from 10,000 to 1 billion, does Amdahl's Law lead to a stronger or weaker deviation from the ideal speedup? Why?
- At what point does the near-constant cost of communication and I/O begin to outweigh the benefit of adding more nodes? How does the crossover point change as n increases?
- In computing the fraction of serial work for the Amdahl's Law speedup estimate, why does the code simply add the communication time to the I/O time, as though they both count as "serial work" in the same way? Is this justified by the timing results?