Exercise: Streams
This short exercise demonstrates the usages of streams and asynchronous memory transfer. You can compile and run your code at any point during its development to check your progress. If you are using TACC resources, please refer to the earlier page with instructions for Slurm.
To begin, we will adapt the code from Exercise: Thread Mapping. Here is the original code:
Showcasing streams will require more than one kernel invocation. Here, we keep the one kernel we defined previously, but we will invoke it on two separate matrix variables. After this modification to main()
, there are two kernel invocations that call the same kernel function, but with two different inputs.
To create a stream, we must first declare a cudaStream_t
variable and then initialize a stream using cudaStreamCreate
. Then, during kernel invocation, the stream variable is passed as the fourth parameter, specifying the stream on which the kernel is enqueued. At the end of the program, the streams are destroyed with cudaStreamDestroy
.
Asynchronous memory transfers are slightly more complex. First, since asynchronous memory transfer requires both pinned memory and streams, we will implement pinned memory for the host matrices. We replace cudaMemcpy
with cudaMemcpyAsync
and specify a streams to transfer each matrix asynchronously. We also need to include cudaDeviceSynchronize()
, because without it, the host will attempt to validate the results before transferring the data back. Lastly, we free the pinned memory with cudaFree
instead of free
.
Interestingly, cudaMemcpyAsync
will not cause an error if memory is not pinned; instead, CUDA will transfer memory as if cudaMemcpy
was called. You can observe this behavior if you revert to using malloc
and remove cudaDeviceSynchronize
. There will not be an assertion error, as the host becomes synchronized with the device through the calls to cudaMemcpyAsync
.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)