NVIDIA Turing SM image, smaller size, as described in the main text
NVIDIA Turing SM image, larger size, as described in the main text
NVIDIA Turing Streaming Multiprocessor (SM) block diagram.

Let's zoom in on one of the streaming multiprocessors depicted in the diagram on the previous page. Each Turing SM gets its processing power from:

  • Sets of CUDA cores for the following datatypes
    • 64 FP32 CUDA cores
    • 64 INT32 CUDA cores
  • 8 Turing Tensor Cores
  • 16 Special Function Units
  • 4 Texture Units
  • 1 Ray Tracing core (RT core)

Just like the SMs in Volta GPUs, a Turing SM is partitioned into 4 processing blocks, as shown in the figure. This flexible layout allows the CUDA cores to schedule up to 2 warps of FP32 or INT32 (possibly 1 of each) for processing on every cycle. But unlike the V100, the Turing T4 does not possess FP64 capability.

Data are supplied to the bulk of this hardware through 16 Load/Store units (as compared to 32 for a Volta SM). These Load/Store units appear at the bottom of each processing block, above.

Turing Tensor Cores offer a range of precisions for deep learning training and inference, from FP32 and FP16 to INT8 and even INT4. Turing's multi-precision computing capability advances NVIDIA's Tensor Core technology beyond Volta and enables even more efficient AI inference.

The special Ray Tracing core, though it is an interesting feature, does not typically enter into GPGPU applications. Therefore, its capabilities will not be described here.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement