If we look at the roofline plot again our peak for potential speedup for our bottleneck loop is this Dell Precision FMA peak, which is which is way up here at 40 GFLOPS. Now we're far from that. And in fact, according to this roofline plot, if we increase performance somehow miraculously up to this degree we are going to be bound by L1 bandwidth so if we want to reach optimal speedup here we will have to increase the arithmetic intensity and we can do that by, well, one way to do it would be by reducing the number of memory accesses for a floating point operation. So for future potential improvement, note that the parallelization scheme is well optimized to split up this matrix-matrix multiplication of the Legendre transform and evenly distribute it as a vector-matrix multiplication. This has an arithmetic intensity of two. However, if we can restructure the algorithm from the top down to bundle these operations into batch matrix-matrix operations we could dramatically improve the arithmetic intensity to n/2 and I think only by doing this kind of transformation will we be able to reach the peak performance of the KNL.