We now zoom in on one of the streaming multiprocessors depicted in the diagram on the previous page. Each Volta SM gets its processing power from:

  • Sets of CUDA cores for the following datatypes
    • 64 FP32 CUDA cores
    • 64 INT32 CUDA cores
    • 32 FP64 CUDA cores
  • 8 Tensor Cores
  • 16 Special Function Units
  • 4 Texture units

In general, an SM is subdivided into 4 processing blocks, as shown in the figure. This allows for greater flexibility in scheduling. The CUDA cores can be scheduled so that on every clock cycle, up to 2 warps of FP32 or INT32 can be processed (possibly 1 of each), or just 1 warp of FP64.

NVIDIA Volta SM block diagram, smaller size, as described in the main text
NVIDIA Volta SM block diagram, larger size, as described in the main text
NVIDIA Volta Streaming Multiprocessor (SM) block diagram.

Data are supplied to the main part of this hardware through 32 Load/Store units, as shown at the bottom of each processing block.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement