Our hotspot and vector analyses tells us how the code behaves on a single MPI rank, but this code of course is intended to be distributed amongst many ranks. So in order to get an idea of how this application will behave in practice, we need to do some scaling studies. So the following three plots illustrate the behavior of this code on Stampede2. These are weak scaling studies in which we keep the problem size fixed while varying the number of ranks and we distribute those ranks amongst nodes and threads in a variable manner. So the problem we've chosen is a fixed spherical grid with a discretization of 160 gridlines along the radial dimension, 80 gridlines along the theta dimension, and 80 gridlines along the phi dimension. This was presented as an examplar target case that the PI's are ultimately going to be running. So our first plot was computed on a single node of both a KNL core and also ran it on a Sandy Bridge Xeon core for comparison. What we see here is a speed up plot, so everything scaled by the performance on a single KNL core. So each plot is the time it takes divided by the time it takes at a given number of ranks divided by the time it takes on a single rank. On the KNL, or in both cases, it scales fairly linearly up until we get to the number of physical cores on the machine. It's interesting to note here that the KNL surpasses or almost meets the performance on Sandy Bridge at 16 ranks. This is somewhat interesting because the Sandy Bridge Xeon cores are almost twice as fast as the KNLs, however, we're vectorized so the KNL has twice the vector length as the Sandy Bridge. So it's not too surprising to see that we get almost as good a performance on the same number of ranks. But of course the KNL has 64 cores, so we can improve scalability considerably past the capability of the Sandy Bridge. As I mentioned, the scaling plateaus beyond the number of cores, but this is pure MPI, so we have one rank per core up to the total number of available cores. And then beyond that, we're adding more than one MPI rank per core. So of course you want to see how this scales across nodes as well as within a node. This plot, each plotline is arranged as a constant number of nodes, so in this case a single node is the same as this KNL curve here. So what we see is that we get actual max performance of 0.223 seconds per iteration and we get that at 128 ranks, using 4 nodes and 32 ranks per node. It's somewhat surprising considering that we got our peak performance on a single node at 64 ranks. We do slightly better by decreasing the number of total ranks per node. For internode performance, we see the scaling efficiency degrades when we move beyond using 8 nodes. The same plot, if we arrange it in collections of constant processes per node, so rather than a curve being all single-node or all two-nodes, we have a number of processes per node. For example, this blue diamond -- all these plots -- all these plots were done with 32 processes per node, so this would be one node, two nodes, four nodes, etc. And we can see here is again optimal performance is found at 128 total ranks, but we do we do pretty well at either 4 or 8 nodes using 32 or 16 processes per node. What this shows us is that we get optimum performance while underutilizing thread capacity. And so the possible reasons for this is that as we scale up, we're suffering from MPI communication overhead across nodes, or possibly we are suffering from core or thread resource contention within a node. So we should figure out which one of those happening.