Inside a Volta SM
We now zoom in on one of the streaming multiprocessors depicted in the diagram on the previous page. Each Volta SM gets its processing power from:
-
Sets of CUDA cores for the following datatypes
- 64 FP32 CUDA cores
- 64 INT32 CUDA cores
- 32 FP64 CUDA cores
- 8 Tensor Cores
- 16 Special Function Units
- 4 Texture units
In general, an SM is subdivided into 4 processing blocks, as shown in the figure. This allows for greater flexibility in scheduling. The CUDA cores can be scheduled so that on every clock cycle, up to 2 warps of FP32 or INT32 can be processed (possibly 1 of each), or just 1 warp of FP64.
![NVIDIA Volta SM block diagram, smaller size, as described in the main text](GV100SMDiagram.png)
![NVIDIA Volta SM block diagram, larger size, as described in the main text](GV100SMDiagram.png)
Data are supplied to the main part of this hardware through 32 Load/Store units, as shown at the bottom of each processing block.