In our introduction to Threads, Blocks, and Grids, we briefly hinted that blocks and grids are, in fact, multidimensional. Each thread has access to two built-in variables: blockIdx and threadIdx, which represent the index of the thread within the block and the index of the block within the grid, respectively. In the previous exercises, we created scenarios where organizing threads in one dimension was sufficient. We extensively used the variables blockIdx.x and threadIdx.x to access the x-dimension of the block and thread indices. For more complex use cases, such as simulations or tasks involving multidimensional data, incorporating multidimensional blocks and grids may be a more suitable approach.

Revisiting: Threads, Blocks, and Grids

In any CUDA program, the host code initializes an execution configuration for the kernel function, which initializes the grid on the device. In the previous exercises, the kernel invocation was accompanied by two int arguments, blocksPerGrid and threadsPerBlock, that specify the number of blocks per grid and number of threads per block, respectively. But these two arguments are not necessarily integers; more generally, in the case of multidimensional grids or blocks, they are of type dim3.

Initialization of dim3 variables uses the C++ constructor style, as follows: dim3 blocksPerGrid(R, S, T) and dim3 threadsPerBlock(U, V, W). (This C++ syntax works even in C, and since CUDA's dim3 type is really a struct, C-style initialization with curly braces is also possible.) Given integers R, S, T, U, V, and W, the declarations create dim3 type variables blocksPerGrid and threadsPerBlock. When these variables are passed as arguments to the kernel call func<<<blocksPerGrid, threadsPerBlock>>>(), the kernel will be executed on a grid having R × S × T blocks in its x × y × z dimensions, with each block having U × V × W threads in its x × y × z dimensions. It is also possible to mix variables of type int and dim3 in the same kernel call, as in the following example:

The grid in this case will be interpreted as consisting of 3 × 1 × 1 blocks, and each block will have U × V × W threads.

The dimensions of grids and blocks cannot be arbitrarily large. The limits are defined by the Compute Capability (CC) of the device, with the maximum dimension of a grid generally being (231−1) × 65535 × 65535 and the maximum dimension of blocks being 1024 × 1024 × 64. The latter limit is impossible to reach, as the maximum number of threads per block is 1024; a typical choice might be 32 × 32 × 1. The above limits on grid and block dimensions have not changed since CC 7.0 and 2.0, respectively, and they are still true for today's NVIDIA devices (through CC 12.0).


On the device side, each thread has access to built-in variables blockIdx and threadIdx and their components x, y, and z, which represent the thread index or block index across a specific dimension. Two more built-in variables, blockDim and gridDim, indicate to the threads the dimensions of their blocks and grids, respectively, in the current kernel. If the grid or block is one-dimensional, then the y and z components are set to 1.

In this example, the func kernel will be configured with gridDim.x = 3, blockDim.x = U, blockDim.y = V, and blockDim.z = W. These variables are available to the threads so they can compute their i, j, k indices in a global 3D index space, as shown.

Grid-Stride Loops

The configuration above is suitable if the size of the input data is known and the programmer can define the size of a grid to compute the data. However, if the size of the data is unknown at compile time or the number of elements exceeds the maximum number of threads (such that computing one element per thread is infeasible), then the kernel configuration will fail. To create an error-proof kernel, consider implementing grid-stride loops.

In a grid-stride loop, each thread is able to operate on multiple elements of the input data, where the elements are separated by total number of threads in the grid. For example, in 1D, a thread would handle every T-th element of the input data, with an offset of i, where T is the total extent of the grid along x, and i is the sequential x-index of the thread amongst all of the threads. This way, each thread will handle a unique set of data, and all data will be processed, making the code agnostic to the size of the input.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)