GPU Performance Topics
Zilu Wang and Steve Lantz
Cornell Center for Advanced Computing
8/2025 (original)
GPU programming is all about performance. So far, you have learned how to manage threads and memory in your CUDA program. You also have learned how different parts of memory are faster than others. However, there are many additional factors that need to be considered in order to create a fast CUDA application. In this final topic of the CUDA introduction roadmap, several performance topics will be discussed.
Objectives
After you complete this roadmap, you should be able to:
- Explain how each technique improves performance.
- Identify potential areas of improvement in CUDA code.
Prerequisites
This topic covers basic CUDA programming and its connection to GPU architecture using the C programming language. A working knowledge of C/C++ and some understanding of parallel computing are necessary for this topic. Thus, you may want to complete An Introduction to C Programming and Parallel Programming Concepts and High-Performance Computing before beginning this topic. While GPU terms are explained in the context of CUDA programming, this topic does not cover the specifics of GPU architecture; you may want to complete Understanding GPU Architecture to learn more about that. No prior experience with CUDA programming or GPUs is assumed.
Should you need an in-depth reference, NVIDIA provides complete documentation for CUDA. Visit their website to see the latest versions of their NVIDIA CUDA Runtime API and CUDA C Programming Guide.
The Frontera User Guide and Vista User Guide have just a few short sections on GPUs with information on node types, job submission, and machine learning software. If you're on Frontera or Vista, be sure to load the CUDA module before compiling any programs. To load this module, issue the command to load CUDA 12.2 on Frontera or CUDA 12.5 on Vista.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)