Below is a full CUDA code that multiplies a pair of matrices using two different kernels: a naive implementation, and a tiling implementation that operates on a succession of submatrices that are stored temporarily in shared memory. The code runs the two kernels 10 times each and compares their answers and their speed.

  1. Examine the code carefully to understand how the two kernels work.
  2. Download the code or copy and paste its contents into a file.
  3. Compile and run the code on a GPU node; instructions for TACC were provided previously.
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)