Roofline analysis indicates that the primary bottleneck loop is still well below the double-precision FMA peak, and that the loop remains limited by cache bandwidth. At this point, any further significant performance gains will probably need to be achieved through some algorithmic restructuring of the code.

Video Transcript
Key points:
  • In order to approach rooflines associated with optimal performance, the arithmetic intensity of key loops will need to increase. By definition, this can be achieved by reducing the number of memory accesses per floating point operation.
  • One possible algorithmic approach aimed at increasing the arithmetic intensity would be to restructure the matrix-matrix multiplication at the core of the Legendre transform. At present it is implemented via a series of vector-matrix multiplies, which have constant arithmetic intensity of AI = 2. Reimplementing the multiplication as a batch of smaller matrix-matrix multiplies would instead result in an AI that grows with size of the matrix n.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement