Thread Divergence
Recall that threads from a block are bundled into fixed-size warps for execution on CUDA cores, and that all the threads in a warp must execute the same instruction at the same time. To put it another way, the threads within a warp must follow the same execution trajectory. How then can a warp of threads handle a conditional or if
statement, in CUDA?
if-else
Let's look in detail at what happens when CUDA is presented with an if-else
construct. Assuming some threads in a single warp evaluate the if
condition as 'true' and others as 'false', the 'true' and 'false' threads will branch to different instructions. Some threads will want to proceed to the if
block, while others will want to proceed to the else
block. In CUDA, this condition is called thread divergence. Branching is perhaps the most common cause of thread divergence.
Intuitively, one would think that the statements in the if
and else
blocks should be executed in parallel. However, because the threads in a warp are required to execute a single stream of instructions, CUDA can only "fix" this type of thread divergence with with a workaround. When executing an if-else
construct, CUDA instructs the warp to execute the if
part first and then proceed to the else
part. While executing the if
block, all the threads that evaluated the condition to be false (i.e., the "else" threads) are instructed to do nothing; they are effectively deactivated. When execution proceeds to the else
block, the situation is reversed.
As you can see, the if
and else
parts are not executed in parallel at all, but rather one after the other, with some fraction of the threads remaining idle in each part. This means that thread divergence can result in a significant performance loss. The following pseudocode illustrates the point.
On the other hand, the next piece of pseudocode is not considered a case of thread divergence, because not all warps would face a divergence. In fact, there might be only one affected warp, or no affected warp if N is a multiple of 32.
For-Loop
Similar to if-else, a for-loop can also cause thread divergence. If the loop parameters are not exactly the same for all threads in a warp, then the threads will not be able to execute the loop completely in parallel. Suppose in a warp of threads, certain threads will execute the loop five times, and other threads will execute the loop ten times. The first five iterations will be executed in parallel, as expected. From the sixth iteration onwards, the threads that have finished their loops will stay idle, while the other threads will continue with their final five iterations.
As you might imagine, this type of situation gets worse for loops having parameters that are very thread-dependent. The loop could even end up entirely serialized, with all but one thread being idle at any given time. If such loops are large, there could be significant performance loss; generally, smaller loops are optimized by CUDA at compile time.
For-loop performance issues can be exacerabated by nesting them with if-else constructs. They can also be worsened by placing a call to __syncthreads()
after a loop, which forces an entire thread block to synchronize at the end of the loop, and not just each warp separately.
A Deadlock Example
Thread divergence can also cause a program to deadlock. Consider the following example:
Threads with an even index will execute the if
block, and at the end of the block, they must wait for all the threads to reach __syncthreads()
. However, the odd-numbered threads can never reach that same call. Instead, after stepping along with the even-numbered threads (though in a deactivated state), they will proceed into the else
block. Ultimately, the odd-numbered threads will be waiting at the second call to __syncthreads()
, at the end of the else
block. Therefore, the two sets of threads end up getting stuck at different calls to __syncthreads()
, unable to proceed. This is a notable issue with using __syncthreads()
in an if-else construct. This problem can be resolved by placing __syncthreads()
after the if-else construct.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)