Exercise: Thread Mapping
This short exercise demonstrates how to invoke a kernel with a two-dimensional grid and blocks to compute the transpose of a matrix. You can compile and run your code at any point during its development to check your progress. If you are using TACC resources, please refer to the earlier page with instructions for Slurm.
As explained in the previous topic, grids and blocks can have higher dimensions to facilitate more convenient computation. We will modify an existing CPU program that transposes a square matrix to make use of these features. Here is the original C program:
If we want to modify this program to be run on a GPU, there are several issues that need to be addressed.
The first issue is the matrix allocation. In CUDA, it is possible to allocate a matrix like you would in C, with pointers to pointers of arrays, as was done above. However, allocating and referencing a matrix in this way is inefficient and generally not recommended. For example, to fetch the element [0,0], the device needs to access the global memory (or L1/L2 cache) twice: once for the pointer to the first row, and once for the 0th element.
Another issue is that data fetching is most efficient if the data is contiguous in memory. When the GPU accesses any memory, it fetches a fixed amount of bytes. If the data is not contiguous, multiple fetches are necessary. Depending on the size of the matrix, this can be a significant performance bottleneck. Hence, in all CUDA applications, data is almost always stored in a flattened array.
The issue is not too severe in this case, given that each long row is stored in contiguous memory. But it is still desirable to avoid the double-pointer issue, so we will convert the code such that one-dimensional arrays are used to store the matrix.
Notice that in both the original transpose function and in this modified CPU version, each matrix row is contiguous in memory, so fetching rows of the original matrix will be efficient. However, columns are stored with a large stride in memory, so storing columns into the transposed matrix will necessarily be inefficient. This type of inefficiency is unavoidable in any simple implementation of a matrix transpose operation. Still, the modified version does gain some benefit from dereferencing just one pointer instead of two.
Another point of consideration for the GPU code is the sizes of grids and blocks. These concepts are mainly used for the programmer's convenience, and there are multiple grid and block configurations for any task, with varying levels of efficiency. We will first use a naive, single-block configuration of 1024 × 1024 threads, with each thread being responsible for copying one element of the matrix. (We can imagine orienting the original matrix so the row index i
is along the x-axis and the column index j
is along the y-axis, in the kernel function.)
This code will compile and run, but you will notice that it does not pass the assert statements. If you investigate further, you will see the transposed matrix has all 0s. This is a sign that an error occurred in the transpose function or the device. The reason is that (as previously mentioned) the maximum number of threads per block is 1024. However, the block configuration for this case is 1024 × 1024, exceeding the limit by a factor of 1024. The device will not be able to run the kernel, and the matrix is not transposed.
Clearly, we need to define more blocks that are smaller in size. What other grid or block configuration could be a solution? Since the maximum number of threads per block is 1024, we can let each block transpose a 32 × 32 section of the matrix and configure the grid to be 32 × 32 blocks so that the entire matrix can be transposed.
Here are a few questions to ponder, and some challenges to try:
- How would this be modified such that it works for non-square matrices?
- Would other grid/block configurations run any faster? You can try a few.
- The final version of the code only works for matrices up to size 1024 × 1024. Modify the code so that it handles matrices of any reasonable size.
- Modify the code to implement a grid-stride loop, so that each thread handles more than one element. No matter the grid or block configuration and matrix size, the program should run smoothly.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)