Thread Synchronization
    Threads within a block usually don't complete their tasks simultaneously. To ensure all threads in a block are synchronized at a certain point in an application, a synchronization barrier, __syncthreads(), needs to be implemented. Since __syncthreads() operates only on the threads within the same block, its scope is more limited than other CUDA synchronization functions described earlier, like cudaDeviceSynchronize() and cudaStreamSynchronize().
    Most commonly, __syncthreads() is used to prevent a race condition, which can happen if threads store their computed values into shared memory, and then try to access those values. In the following example, a synchronization barrier is placed after the vector-add operation to ensure all threads complete the addition operation before continuing. If the synchronization barrier did not exist, then when the threads calculate their values of c, the last thread of the first warp might try to access an element of val that hasn't yet been computed or stored by the first thread of the second warp.
    Thread synchronization nearly always comes at a performance cost, because the earliest threads to reach a call to __syncthreads() must wait and remain idle until the very last thread reaches the same point. The same is true of the other CUDA synchronization functions; if one or a few threads take a long time to finish, then resources can be tied up at a synchronization point. Nevertheless, as the example shows, inserting a function like __syncthreads() can be absolutely necessary for the sake of correctness. To avoid the cost of the barrier, the programmer should consider if there are other ways of computing a desired result that do not require synchronization.
    There's a further reason for caution with __syncthreads(): wherever it is placed, all the threads in a given block need to reach that same point before they can execute the next instruction. If some threads are able to bypass the same call to __syncthreads() due to branching, then the other threads that have encountered it are stuck in a deadlock condition and can never progress past that point. This is a programming error! The programmer must ensure that for every block of threads, each call to __syncthreads() is executed by either all of the threads in the block, or none of them.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)