Shared Memory
In the device code, each kernel function manages the device memory that it requires for storing its internal variables. As we saw previously, many of these internal variables are "automatic" variables and arrays, i.e., thread-local variables that appear only in the scope of the kernel function; they are held in registers or local memory. The GPU's shared memory is a special case of device memory, and CUDA provides mechanisms for establishing how the shared memory is to be used by the threads in the same thread block. Like all the variables defined in the scope of a kernel function, shared-memory variables and arrays do not persist after the function exits.
Shared Memory
In CUDA, shared memory refers to on-chip memory that is private to a thread block and is accessible to all the threads of that block. Theoretically it is about 100x faster than global memory when fetching data to threads. As mentioned previously, the size of this private shared memory is based on the compute capability of the GPU device.
    If the size of an array in shared memory is known at compile time, we may use the CUDA specifier __shared__ to declare its size statically in the body of the kernel function, as follows:
    If the size is unknown at compile time, shared memory can be dynamically allocated using special CUDA syntax. To indicate a dynamic, shared array, prepend the keywords extern __shared__ to an unsized array (or simply a pointer) that is declared in the kernel function:
    Then, the size is specified in the host code's kernel call through an optional third parameter in the execution configuration area, <<<...>>>. This parameter gives the size of the dynamic array in bytes. Here is how to allocate n integers of shared memory to the sharememory array in the dynamicmem() kernel function, above:
    The reason for specifying the size through a configuration parameter, prior to entering the kernel function, is that the main body of the function is executed by each thread independently. It would not make sense if each thread could set its own size for a __shared__ array, because all the threads in a thread block must see the same shared memory.
    However, this seems to create a limitation for dynamic allocation, because only one __shared__ array can have its size specified dynamically. What if multiple arrays are needed? In that case, declare a single unsized array as before, and set pointers into it to divide it into multiple arrays. Here's an example code that does this for two arrays:
Then, in the kernel call, you would specify the total shared memory you need like this:
    The two int arguments to the kernel function are included as a way of letting all the threads know the dividing line between the int and float arrays in shared memory, as well as the extent of each one. Note that special considerations are needed in case an array does not have the same type size as the declared array in the device code, such as a char array for the above example. 
Together with memory management, shared memory will be highlighted in the exercise.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)