So in order to get a handle on how the code performs on a given architecture, Shiquan used gprof as a quick analysis of where the bottlenecks might lie. But I prefer to use Intel tools and their most recent suite of tools the 2018 -- I forget what they call their suite of tools -- but the 2018 tools including VTune are really nice, comprehensive, and easy to read. I ran Intel's VTune Amplifier, and if you want to do this yourself, I did this deployed onto a compute node. So you take your executable line (your command line for your executable) and just insert in this case amplxe-cl, you can also run this in interactive GUI mode, but to deploy this in batch, just insert amplxe-cl, and in this case I wanted to collect the hotspots. The result of which you can see in the table below. We have this list, this function stack, and this loop at line 31 in main is the loop that controls the time stepping. So within this loop, most of the time is spent in the Legendre transform function tra_qst2rtp. So this function took up 19% of the run time within the time-stepping loop. The next two most costly functions -- vel_transform and mag_transform -- actually call the tra_qst2rtp function, so any improvements made there (and for the future I will just call this function the "bottleneck function" to save myself some tongue-tripping), so any improvements made in the bottleneck function are going to carry over to the next to the most expensive functions as well. Now that we know which function or functions are taking up most of the time we can take a closer look using Intel Vector Advisor; very similar to VTune, just insert advixe-cl -collect survey. And the result is a more detailed breakdown of the loops within our bottleneck function. And I can see here that the first three lines of the first most time-consuming loops, and they occur at line 335, 325, and 316. The bottom three are located in the second most costly function, but we'll just focus on our identified bottleneck function. So what this survey also tell this is the contribution to run time vectorization efficiency, which we'll talk about later, gives a measure of gigaflops and arithmetic intensity, it can also give us some clues to performance issues.