Cornell Virtual Workshop > Understanding GPU Architecture > GPU Example: Tesla V100

Inside a Volta SM

We now zoom in on one of the streaming multiprocessors depicted in the diagram on the previous page. Each Volta SM gets its processing power from:

Sets of CUDA cores for the following datatypes
- 64 FP32 CUDA cores
- 64 INT32 CUDA cores
- 32 FP64 CUDA cores
8 Tensor Cores
16 Special Function Units
4 Texture units

In general, an SM is subdivided into 4 processing blocks, as shown in the figure. This allows for greater flexibility in scheduling. The CUDA cores can be scheduled so that on every clock cycle, up to 2 warps of FP32 or INT32 can be processed (possibly 1 of each), or just 1 warp of FP64.

NVIDIA Volta SM block diagram, smaller size, as described in the main text — NVIDIA Volta Streaming Multiprocessor (SM) block diagram.

NVIDIA Volta SM block diagram, larger size, as described in the main text — NVIDIA Volta Streaming Multiprocessor (SM) block diagram.

Data are supplied to the main part of this hardware through 32 Load/Store units, as shown at the bottom of each processing block.

Back