Since there are many parallel components in a CUDA program, there are naturally different types of synchronization methods that apply to these components. Below, we will discuss three methods: host-side synchronization, device-side synchronization, and thread-level synchronization.

Definitions of Synchronization

The definition of synchronization is the act of bringing two or more processes to known points in their execution. When a process waits for another process to reach a certain point before continuing, the (former) process is blocked. This definition of synchronization is a broad one in CUDA's context, as there are multiple processes running in parallel, and we must distinguish what exactly is being synchronized.

Host-Side Synchronization

Throughout a CUDA program, it's often necessary for the host to synchronize with the device or with a specific stream. For example, the host must wait until all kernels are completed before retrieving the output data or launching new kernels. In any case, the host executes a specific line of code, after which it waits for the device to finish its specified task. We consider this to be host-side synchronization or blocking the host.

Host-side synchronization can be achieved by using the cudaDeviceSynchronize() function and, interestingly, the cudaMemcpy(...) and cudaFree(...) functions. For all these functions, the host cannot continue until the device executes all the currently queued kernels in all the streams. While these functions have other purposes, they implicitly synchronize the host with the device. Additionally, one can use cudaStreamSynchronize(stream) if the host needs to wait only for a specific stream.

Streams
Timelines of the host and streams where the host synchronizes the streams implicitly with cudaMemcpy(...). Once copying is complete, new kernels can start on the streams. The host then synchronizes the streams explicitly with cudaDeviceSynchronize(); the host and all streams must wait for the slowest stream, Stream 2, to finish.
Device-Side Synchronization

On the device side, non-default streams can synchronize with other non-default streams, and the default stream can synchronize all of the streams. Note that device-side synchronization will not dictate the behavior of the host (the converse is false).

Any kernel invoked on the default stream blocks the default stream until the active kernels in other streams have completed. Any kernel enqueued afterward will likewise be blocked until the default stream completes. Thus, launching a kernel on the default stream is a method for synchronizing all streams. In addition, cudaMalloc and cudaMemcpy share the same behavior.

Streams can be synchronized with other streams through CUDA events, which are identified by variables of type cudaEvent_t, and by the functions cudaEventCreate, cudaEventRecord, and cudaStreamWaitEvent. A CUDA event is a synchronization marker that can be recorded by streams and waited for by streams. Once an event is recorded by another stream, the streams waiting on that event may start their respective kernels. The entire process is as follows. First, a variable of type cudaEvent_t is declared and initialized with cudaEventCreate() in the host code. Then, the host calls cudaEventRecord() to instruct a stream, stream2, to record the event when it reaches certain place between the execution of its kernels. Likewise, the host calls cudaStreamWaitEvent() to instruct another stream, stream3, to wait for the event prior to the execution of its next kernel. When stream2 reaches the specified point, it records the event and continues with its next kernel. Meanwhile, Stream_3 has been waiting for the event and has been blocked from executing its kernels. When the event is recorded, stream3 is given the signal to continue with its kernels. Of course, like all CUDA variables, events need to be destroyed with cudaEventDestroy() at the end.

Streams
Timelines of the host and streams where two of the streams are being synchronized by an event. After running func2, Stream 2 records event end_2; Stream 3 must wait for the event end_2 before it can run func3.
Below is the pseudocode for stream synchronization.
Thread-Level Synchronization

Threads can be synchronized with other threads within the same block. If a thread uses a variable that is updated by another thread, then the order of these threads must be specified, or the result will be unpredictable. To prevent threads from executing lines in the wrong sequence, one can use __syncthreads() to ensure that all threads within the block reach the line where __syncthreads() appears, before continuing. An important note is that all running threads must reach the same __syncthreads(); otherwise, if the line __syncthreads() becomes unreachable to a thread that hasn't exited, it causes unexpected errors.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)