This exercise takes you through some of the parallel speedups that are available on the Intel Xeon Scalable Processor platform. While the discussion here is specifically presented for working with the CLX processors on Frontera, a very similar workflow with some minor modifications will enable this code to work on the SKX processors on Stampede3. As with previous exercises, these modifications typically include: changing core count to reflect the smaller number of cores on SKX nodes (48 vs. 56); changing the name of the partition/queue to run on (skx-dev vs. development); and changing allocation account numbers to be able to run on Stampede3. Note that the performance results for the exercises will look considerably different on Stampede3 due to its having different compilers and hardware characteristics.

In this exercise, we will address two of the three dimensions of scaling: "scaling deep" through the use of vectorization, and "scaling up" through the use of multiple cores. Because the example code is small and simple enough for clarity's sake, it does not make sense to "scale out" to multiple nodes using MPI, but that strategy is of course valuable for many different problems.

1. Compile the hybrid code

The code for this lab, ompipi.c, has two types of parallel constructs built into it: MPI function calls for message passing, and OpenMP directives for multithreading. OpenMP directives are surrounded by #ifdef's (preprocessor directives) so that OpenMP can be omitted from the compiled code, if desired. As with earlier exercises, you will either want to download the file and transfer it to the target machine, or copy-paste the source code directly to that machine.

Take a minute to look over the source code. The real action takes place about halfway down. The inner loop over i computes an approximation to the integral of a function f defined at the end of the file. It is chosen so that

\[ {f(x)dx = d(\arctan(x))\times4} \]

and therefore, the integral from 0 to 1 should be equal to \( \pi \). Everything else in the code merely sets up MPI, OpenMP, and the timing loops. (The function MPI_Wtime is a "wallclock timer" that is part of the MPI library.)

When totalTasks is greater than 1, the loop over i becomes a strided loop. In that case, every MPI task computes a part of the integral, and MPI_Reduce combines the partial sums into a single sum. Alternatively (or additionally), the partial sums can be also distributed among parallel OpenMP threads. Moreover, the threads can parallelize their summations internally through vector instructions.

This MPI code can be conveniently compiled using the compiler wrapper mpicc. Do so now, accepting the defaults for the shell environment and for the constants within the code. Initially we'll want to ensure that the result is optimized for CLX, but let's neglect OpenMP multithreading until we're ready for it:


mpicc --version   # verify that you are using the Intel compiler
mpicc -O3 -xCORE-AVX512 -qopt-zmm-usage=high ompipi.c -o mpipi

The "o" is left off of the name of the executable to denote the absence of OpenMP.

2. Run with and without vectorization (VPUs)

We'd like to run our tests interactively. It will be easiest if we do this on a single CLX compute node. On Frontera at TACC, the command to start an interactive session on 1 compute node, on which we can run up to 56 total tasks, is:


idev -n 56 -N 1 -p development

You might need to augment the command above with additional information about a specific account if you have multiple allocations. If the development queue is full, don't worry—any available CLX queue will do. Let's start just by timing one process without multithreading. Note that on most clusters other than TACC's, mpirun or mpiexec is generally used in place of ibrun.


ibrun -n 1 ./mpipi         # (use ibrun for multiple TACC nodes)

The output should confirm 1 process and 1 thread, after which it will report the best time from 10 trials of calculating \( \pi \). Make a mental note of the best time, because now we want to slow down the code by disabling vectorization:


mpicc -O1 -no-vec ompipi.c -o mpipi_novec
ibrun -n 1 ./mpipi_novec

You should observe that the answer doesn't change (except perhaps in the last couple decimal places)—but that the compiled code now runs almost 4 times slower on Frontera, due to the fact that we have turned off vectorization. We might have expected that the vectorized code would run 8 times faster, due to the fact that AVX-512 instructions operate on 8 doubles at a time. Or we might even have hoped for it to run 16 times faster, given that there are actually 2 VPUs for each CLX core. For this particular code, however, we do not see vectorization speedups to that extent. Among other things:

  • We are doing integer-to-float conversions each time through the loop (to get the midpoint of the interval at which to evaluate the function f), and those conversions impede vectorization in the loop.
  • The loop is so short that loop overhead contributes to a lack of vectorization performance.

Nonetheless, we are able to get some speedup just by engaging the VPUs and the AVX-512 instructions. On Stampede3, the speedup obtained by the icx compiler is a bit more modest—only around 2x. For a much more detailed treatment of the various factors that can inhibit compilers from vectorizing loops, see our material on Vectorization.

3. Run 56 MPI tasks on 56 cores

Let's see how much speedup we get by using all of the CLX cores via MPI. Having observed a speedup of approximately 4x due to vectorization, let's now run the vectorized code on multiple cores:


ibrun -n 56 ./mpipi

The answer from this code looks correct, but the run time is so short that we have to wonder if we are starting to reach the limits of the timer's resolution. The easiest way to address this is to give the code more work to do. Let's increase the number of subintervals in the integration, first to 10 million, then to 100 million (i.e., 107, followed by 108):


ibrun -n 56 ./mpipi 10000000
ibrun -n 56 ./mpipi 100000000

Notice that we are now exceeding the limits of accuracy for double-precision arithmetic: you may actually see the error increase for 108 subintervals. We'll want to stick with 108 anyway, as it will allow us to see the full speed boost from the following sequence:


ibrun -n 1  ./mpipi 100000000
ibrun -n 56 ./mpipi 100000000

Compute the amount of speedup resulting from running on all 56 cores on the CLX node, i.e., the ratio of the best times in the two cases. It should be 55.1, or some value very close to 56. This makes sense, as there is not much in the way of communication that might be impeding the scaling of this code. Likewise, on a single node of Stampede3, the speedup should be very close to 48 when 48 tasks are specified.1

4. Utilize 56 OpenMP threads

As noted, the exercise code is also instrumented to be able to run on multiple threads through OpenMP. The number of threads can be controlled through the OMP_NUM_THREADS environment variable. First, we need to recompile the code to access the OpenMP code and directives, and then we can run the code on one thread.


mpicc -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qopenmp ompipi.c -o ompipi
echo $OMP_NUM_THREADS           # default value at TACC should be 1
export OMP_NUM_THREADS=1        # do this if the above value != 1
ibrun -n 56 ./ompipi 100000000

There is little change in the best time, but the first time is a bit longer, presumably because it takes some time to start OpenMP's thread management system.

First, let's re-run the MPI-based version of the code (as done above), with 1 and 56 tasks. Then, instead of running 56 MPI processes, with each process running 1 thread, we can flip this and launch just one process running 56 OpenMP threads.


ibrun -n 1  ./ompipi 100000000
ibrun -n 56 ./ompipi 100000000
export OMP_NUM_THREADS=56
ibrun -n 1  ./ompipi 100000000

Compare the best times for each of the three runs. You should observe ~55x speedup for the 56-task MPI-based run, as compared to the MPI-only run with the single task, just as we saw previously (ibrun -n 56 ./mpipi 100000000). But if you look at the 56-thread OpenMP-based run, you should see that we're getting only about half of that speedup (approximately 27x). Why is that?

The answer has to do with affinity (or more precisely, thread affinity), which describes the manner in which threads are "pinned" or "bound" to cores. The OS is free to migrate threads to different cores, but sometimes that can have adverse affects. Without binding threads to cores, it appears that either some threads are being moved onto the same core, or they need to reload their caches from memory as they move around, leading to a less-than-expected speedup.

Fortunately, there are ways to bind threads to cores, using appropriate environment variables. Within OpenMP, the environment variable OMP_PROC_BIND controls how threads are bound to OpenMP "places", and the variable OMP_PLACES relates those places to the available hardware. In our case, we want to bind threads to cores, and we want do so in a manner that spreads threads out evenly among the available cores. By setting OMP_PROC_BIND=spread and OMP_PLACES=cores, as in the code below, we can achieve the desired binding.


export OMP_PROC_BIND=spread
export OMP_PLACES=cores
export OMP_NUM_THREADS=56
ibrun -n 1  ./ompipi 100000000

Now, with threads spread out and bound to different cores, you should observe that the 56-thread OpenMP-based run achieves ~54x speedup, very close to what we observed for the MPI code. The choice of the "spread" pattern is not too important in this case, since all the cores are occupied and the threads are doing independent calculations. Thus, it should work equally well simply to set OMP_PROC_BIND=true, and this setting may even be optimal when there are fewer threads than cores, due to the way TACC assigns core IDs within dual-socket nodes. The subject of affinity can be very complicated, and we have only scratched the surface. For some additional discussion of how to configure affinities in OpenMP, have a look at the relevant section of Victor Eijkhout's Parallel Programming for Science and Engineering.

While the environment variables described above are defined within OpenMP, the Intel compilers also provide mechanisms for controlling the binding of threads to hardware resources, using the environment variable KMP_AFFINITY. The approximate analog to the OMP-based binding described above is with the following setting: export KMP_AFFINITY=granularity=core,balanced. Setting KMP_AFFINITY=scatter has nearly the same effect. There is some additional discussion of how to use KMP_AFFINITY in our companion material on hybrid programming.

Even with our simple example code, we have been able to achieve speedups both through vectorization ("scaling deep") and through the use of multiple cores ("scaling up"). In addition, we have encountered several real-world issues along the way that impact actual performance, as opposed to one's naive expectations of speedups based on processor specifications. This includes issues such as incomplete vectorization, and the need to assign the affinity of threads to cores.

Our code example is not typical in that it has hardly any references to memory. It therefore tells us nothing about the memory side of Xeon Scalable Processor performance. Given an example that included more memory operations, we might see an some of the potential benefits and pitfalls of multithreading. Unlike MPI processes, threads share memory, and while this can give multithreaded codes the advantage of a smaller memory footprint, it can also lead to memory contention among threads. Slower memory accesses may be the result, unless threads always access separate cache lines in memory. However, multiple threads and/or processes are required for achieving full memory bandwidth.

1Historical note: when the SKX nodes were part of the former Stampede2, the observed speedup was only about 27x on 48 cores. Why? The reason has to do with Intel's Turbo Boost technology, which permits the clock speed to vary based upon the vector instruction set used, the number of active cores, and other factors affecting power requirements and temperature limits. (See the page Guidance on Instruction Sets for additional information on this, as well as some companion material on Turbo Boost.) On the former Stampede2, the SKX clock would run much faster when fewer cores were in use. Thus, the SKX clock might run at 3.5 GHz with 1 core running AVX-512, but slow down to 2.0 GHz with all 48 cores in use. As a result, even though one would expect a parallel speedup of 48x on 48 cores, one would instead see 48*(2.0/3.5) = 27.4 ≈ 27. Now, however, it appears that TACC's CLX and SKX nodes are configured to run at a lower, "all-core boost" frequency regardless of how many cores are in use. This prevents Turbo Boost from interfering with the expected scaling behavior. Still, on some platforms, one should be aware that variation of clock speed based on context (e.g., number of cores in use) will affect the apparent scaling.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement