Cornell Virtual Workshop > Understanding GPU Architecture > Horizon GPUs: Blackwell B200

Inside a Blackwell B200 SM

We now zoom in on one of the streaming multiprocessors depicted in the diagram on the previous page. Each Blackwell B200 SM gets its processing power from:

Sets of CUDA cores for the following datatypes
- 128 FP32 CUDA cores
- 128 INT32 CUDA cores
- 64 FP64 CUDA cores
4 Tensor Cores (5th generation)
4 Special Function Units
4 Texture units

In general, an SM is subdivided into 4 processing blocks, as shown in the figure. This allows for greater flexibility in scheduling. The CUDA cores can be scheduled so that on every clock cycle, either 4 separate warps of FP32 or INT32 can be processed, or 2 warps of FP64 (or possibly some combination of these).

NVIDIA Blackwell B200 block diagram, smaller size, as described in the main text — NVIDIA Blackwell B200 Streaming Multiprocessor (SM) block diagram.
Source: Chips and Cheese

NVIDIA Blackwell B200 SM block diagram, larger size, as described in the main text — NVIDIA Blackwell B200 Streaming Multiprocessor (SM) block diagram.
Source: Chips and Cheese

Data are supplied to the main part of this hardware through 32 Load/Store units, as shown at the bottom of each processing block:

Loads go into the ample 64 KB register file or the 64 KB tensor memory (TMEM) associated with each processing block. A Blackwell B200 SM therefore has, in total, 256 KB of register memory and 256 KB of TMEM.
Stores, on the other hand, are generally pushed out to the 256 KB L1 data cache. From there, they can propagate to the globally shared L2 cache, then into global memory. Note that the L1 cache size can be adjusted downward to make room for special "shared memory" within the SM, where each data item is accessible to any thread.

Data are loaded into the tensor cores through the special "tensor memory accelerator" (TMA), which can do bulk asynchronous loads independently, thereby taking pressure off the CUDA cores.

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)