2D and 3D Blocks and Grids
In our introduction to Threads, Blocks, and Grids, we briefly hinted that blocks and grids are, in fact, multidimensional. Each thread has access to two built-in variables: blockIdx
and threadIdx
, which represent the index of the thread within the block and the index of the block within the grid, respectively. In the previous exercises, we created scenarios where organizing threads in one dimension was sufficient. We extensively used the variables blockIdx.x
and threadIdx.x
to access the x-dimension of the block and thread indices. For more complex use cases, such as simulations or tasks involving multidimensional data, incorporating multidimensional blocks and grids may be a more suitable approach.
Revisiting: Threads, Blocks, and Grids
In any CUDA program, the host code initializes an execution configuration for the kernel function, which initializes the grid on the device. In the previous exercises, the kernel invocation was accompanied by two int
arguments, blocksPerGrid
and threadsPerBlock
, that specify the number of blocks per grid and number of threads per block, respectively. But these two arguments are not necessarily integers; more generally, in the case of multidimensional grids or blocks, they are of type dim3
.
Initialization of dim3
variables uses the C++ constructor style, as follows: dim3 blocksPerGrid(R, S, T)
and dim3 threadsPerBlock(U, V, W)
. (This C++ syntax works even in C, and since CUDA's dim3
type is really a struct, C-style initialization with curly braces is also possible.) Given integers R
, S
, T
, U
, V
, and W
, the declarations create dim3
type variables blocksPerGrid
and threadsPerBlock
. When these variables are passed as arguments to the kernel call func<<<blocksPerGrid, threadsPerBlock>>>()
, the kernel will be executed on a grid having R
× S
× T
blocks in its x
× y
× z
dimensions, with each block having U
× V
× W
threads in its x
× y
× z
dimensions. It is also possible to mix variables of type int
and dim3
in the same kernel call, as in the following example:
The grid in this case will be interpreted as consisting of 3 × 1 × 1 blocks, and each block will have U
× V
× W
threads.
The dimensions of grids and blocks cannot be arbitrarily large. The limits are defined by the Compute Capability (CC) of the device, with the maximum dimension of a grid generally being (231−1) × 65535 × 65535 and the maximum dimension of blocks being 1024 × 1024 × 64. The latter limit is impossible to reach, as the maximum number of threads per block is 1024; a typical choice might be 32 × 32 × 1. The above limits on grid and block dimensions have not changed since CC 7.0 and 2.0, respectively, and they are still true for today's NVIDIA devices (through CC 12.0).
On the device side, each thread has access to built-in variables blockIdx
and threadIdx
and their components x
, y
, and z
, which represent the thread index or block index across a specific dimension. Two more built-in variables, blockDim
and gridDim
, indicate to the threads the dimensions of their blocks and grids, respectively, in the current kernel. If the grid or block is one-dimensional, then the y
and z
components are set to 1.
In this example, the func
kernel will be configured with gridDim.x = 3
, blockDim.x = U
, blockDim.y = V
, and blockDim.z = W
. These variables are available to the threads so they can compute their i
, j
, k
indices in a global 3D index space, as shown.
Grid-Stride Loops
The configuration above is suitable if the size of the input data is known and the programmer can define the size of a grid to compute the data. However, if the size of the data is unknown at compile time or the number of elements exceeds the maximum number of threads (such that computing one element per thread is infeasible), then the kernel configuration will fail. To create an error-proof kernel, consider implementing grid-stride loops.
In a grid-stride loop, each thread is able to operate on multiple elements of the input data, where the elements are separated by total number of threads in the grid. For example, in 1D, a thread would handle every T
-th element of the input data, with an offset of i
, where T
is the total extent of the grid along x, and i
is the sequential x-index of the thread amongst all of the threads. This way, each thread will handle a unique set of data, and all data will be processed, making the code agnostic to the size of the input.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)