Cornell Virtual Workshop > Understanding GPU Architecture > Horizon GPUs: Blackwell B200

B200 Memory & NVLink 5.0

The Blackwell B200 features high-bandwidth HBM3e memory, which is stacked on the same physical package as the GPU, thus permitting more GPUs and memory to be installed in servers. The B200s targeted for TACC Horizon have four memory chips per HBM3e stack, and two stacks (one stack on each die), yielding 192 GB of GPU memory in each chip. This configuration provides 8 TB/sec of peak memory bandwidth across the eight stacks.

Data integrity is important in scientific computing, and the CPUs of HPC systems typically have ECC (error correction code) memory as a guard against data corruption. This feature is found in NVIDIA's GPGPU devices as well, and it is a property of both the Tesla V100 and the Blackwell B200.

Most often, data starts out on the host and must be transferred to the memory on the device. Such memory transfers may be initiated explicitly, but it is also possible to do this through Unified Memory, a single memory address space that permits applications to allocate and access a common set of data from code running on either CPUs or GPUs. This can be especially advantageous in the Grace Blackwell Superchip, due to the NVLink-C2C connecting the Grace CPU to each of the two Blackwell GPUs.

Allocating Unified Memory is straightforward: typical calls to malloc() or new are simply replaced with calls to cudaMallocManaged(). The resulting pointer provides access to the data from any processor. CUDA system software works with the hardware to take care of migrating the requested memory pages to the physical memory of the accessing processor, whether it is on the host or the device.

Several streams of memory transfers are typically initiated on the CPU side, so that some GPU kernels can start doing their computations while other kernels wait for data. This is true of explicit transfers as well as those done implicitly through Unified Memory.

Host-to-Device and Device-to-Device Transfers

The speed at which data can be transferred from the host to the device and back is set by the type of interconnect. Most commonly, GPU devices are inserted into expansion slots in their hosts, and the number of PCIe lanes in the slot determines the peak bandwidth. For example, if a graphics card is connected to the host by 16 lanes (x16) of PCIe 5.0, where each lane is capable of 4 GB/s in each direction, there is an overall bidirectional bandwidth of 128 GB/s between host and device.

TACC Horizon, however, is built around the Grace Blackwell Superchip, which incorporates NVIDIA's NVLink-C2C technology. This means the Grace CPU has coherent connections to each of the two B200 GPUs at 900 GB/s chip-to-chip, much faster than any PCIe connection. Grace Blackwell also includes NVLink 5.0 to provide 1.8 TB/s connections between its two GPUs as well as GPUs on other nodes, through an NVSwitch.

NVLink was developed originally to give dramatic improvements in the bandwidth between GPU devices. But with NVLink-C2C, NVIDIA has extended this capability to its internal host-to-device connections. This important enhancement removes the potential PCIe bottleneck between CPUs and GPUs. In fact, the NVLink 5.0 connections are so fast that they rival the memory bandwidth available to the different processing units in Grace Blackwell: as mentioned, each Blackwell B200 has a peak bandwidth of 8 TB/s to its HBM3e memory, while the Grace CPU is capable of 1 TB/s to its LPDDR5X memory.

The L2 cache in the Blackwell B200 also plays a key role in moving data from the host to the device's global memory, as well as from global memory to the SMs and CUDA cores. The full memory hierarchy and the properties of the various caches within it were detailed in the GPU Memory topic.

In conclusion, the high flop/s rate of a GPU is only meaningful if the data rate from and to the host can keep up. NVLink-C2C, NVLink 5.0, and HBM3e may therefore be crucial in allowing data-heavy applications to use the full speed of the Blackwell B200. In any case, the B200 is a viable option for highly parallelizable applications that need FP64 precision, along with AI/ML applications that can make good use of its abundance of fast memory and fifth-generation tensor cores.

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)