Cornell Virtual Workshop > Introduction to CUDA > CUDA Concepts

Exercise: Hello CUDA

In this exercise, we will build a simple CUDA program. The goal of this exercise is to summarize and demonstrate the basic structure of a CUDA program.

An overview of this exercise is as follows:

Define the skeleton of the host code that will eventually call the kernel function to launch a grid on the GPU.
Define a device function with the __device__ specifier to print a hello message.
Define a kernel function with the __global__ specifier that calls the device function.
In the host code, call the kernel function.
Compile the code using the nvcc compiler and run the code on Frontera or Vista (or on another HPC system or a local machine).

1. The host code of the CUDA program will simply call the kernel function and return a 0 on success. We will include two print statements before and after calling the kernel function to verify the program's execution. For now, assume that as long as the kernel function is called and hello statements are printed, the program is successful.

2. With the host code defined, we will now write a device function—a function run on the GPU and only callable from device code—with the specifier __device__. The device function prints a hello message, followed by the thread and block indexes to differentiate the outputs from the various threads. The input parameters are the thread and block indexes.

3. In the kernel function, denoted by the __global__ specifier and void return type, we call the device function printHello. When this kernel function is called and a grid is launched, each thread executes the printHello function. (In a practical application, a simple program like this would directly print the message in the kernel function, rather than call a separate device function to do it.)

4. In the host code, call the kernel function hello using the chevron syntax <<<, >>>. The two parameters determine the number of blocks in the grid and the number of threads per block. The total number of threads is the product of the two parameters. This example will launch a small number of threads and blocks.

5. Below is the full code for the exercise. Copy and paste the code into a file named hello_cuda.cu in a computing environment with access to nvcc.

The full compilation and submission instructions for Frontera or Vista are provided in Submit a Job to Frontera or Vista. Below is the short version, which loads the CUDA module and compiles the code with the -arch and -code flags, depending on the Compute Capability of your GPU device.

Once the code is successfully compiled, prepare the file for batch submission with sbatch. If you instead used idev to run an interactive job in Frontera's rtx-dev queue or Vista's gh-dev queue, you can run the program directly in the terminal. You might see the following, suprisingly short message as the output:

This output is unexpected as the hello messages are not printed from the function printHello. As mentioned on the CUDA Program Structure page, the host continues to execute its code after calling the kernel function. The program terminated before the device function could print the message from the device. To guarantee the message is printed, we need to instruct the host to wait for the kernel to finish by putting it to sleep with sleep. Note the addtion of a #include macro to include the unistd.h header file. In future topics, proper kernel synchronization methods will be introduced. Modify the code as follows, recompile the code, and check the output.

The output is now:

Below are some additional questions to ponder and experiment with:

When a grid is launched, CUDA automatically defines threadIdx and blockIdx for all the threads in the grid, which can then access these variables in the device code. Modify printHello and hello so neither function takes input parameters. How can you make this work?
What happens if you try to call the device function printHello from the host? (You must implement a solution to the first question, first.)
What happens if you replace the __device__ specifier in printHello with __global__? (You must implement a solution to the first and second questions, before trying this.) Why does it behave this way? What happens if you comment out the call to printHello in hello?

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)