Threads and Cores Redefined
What is the secret to the high performance that can be achieved by a GPU? The answer lies in the graphics pipeline that the GPU is meant to "pump": the sequence of steps required to take a scene of geometrical objects described in 3D coordinates and render them on a 2D display.
Two key properties of the graphics pipeline permit its speed to be accelerated. First, a typical scene is composed of many independent objects (e.g., a mesh of tiny triangles approximating a surface). Second, the sequence of steps needed to render each of the objects is basically the same for all of the objects, so that the computational steps may be performed in parallel on all them at once. By their very nature, then, GPUs must be highly capable parallel computing engines.
But CPUs, too, have evolved to become highly capable parallel processors in their own right—and in this evolution, they have acquired certain similarities to GPUs. Therefore, it is not surprising to find a degree of overlap in the terminology used to describe the parallelism in both kinds of processors. However, one should be careful to understand the distinctions as well, because the precise meanings of terms can differ signficantly between the two types of devices.
For example, with CPUs as well as GPUs, one may speak of threads that run on different cores. In both cases, one envisions distinct streams of instructions that are scheduled to run on different execution units. Yet the ways in which threads and cores act upon data are quite different in the two cases.
It turns out that a single core in a GPU—which we'll call a CUDA core hereafter, for clarity—is much more like a single vector lane in the vector processing unit of a CPU. Why? Because CUDA cores are essentially working in teams of 32 to execute a Single Instruction on Multiple Data, a type of parallelism known as SIMD. In CPUs, SIMD operations are possible as well, but they are carried out by vector units, based on smaller data groupings (typically 8 or 16 elements).
The table below attempts to reduce the potential sources of confusion. It lists and defines the terms that apply to the various levels of parallelism in a GPU, and gives their rough equivalents in CPU terminology. (Several new terms are introduced below; they are further explained on succeeding pages.)
GPU term | Quick definition for a GPU | CPU equivalent |
---|---|---|
thread | The stream of instructions and data that is assigned to one CUDA core; note, a Single Instruction applies to Multiple Threads, acting on multiple data (SIMT) | N/A |
CUDA core | Unit that processes one data item after another, to execute its portion of a SIMT instruction stream | vector lane |
warp | Group of 32 threads that executes the same stream of instructions together, on different data | vector |
kernel | Function that runs on the device; a kernel may be subdivided into thread blocks | thread(s) |
SM, streaming multiprocessor | Unit capable of executing a thread block of a kernel; multiple SMs may work together on a kernel | core |