RTX 5000 Memory & PCIe 3.0
Unlike the Tesla V100, the Quadro RTX 5000 does not come with special 3D stacked memory. Instead, it has 16 GB of the latest generation of graphics memory, GDDR6 SDRAM. And thanks to its 8 memory controllers, the RTX 5000 achieves a healthy peak memory bandwidth of 448 GB/s.
While the above memory size and performance amounts to only half of the 32 GB and 900 GB/s rating of the HBM2 memory in the V100, it should be noted that the RTX 5000 also possesses only 60% as many SMs as its beefier cousin. Similarly, its L2 cache is 4 MB in size, which is proportionally smaller that the 6 MB L2 in the V100. Therefore, in terms of quantity and speed of memory per SM, the two devices are roughly on a par with each other.
The exception to this is the L1 cache and shared memory in each SM of the RTX 5000, which is only 75% as big as the same unified data cache in the V100 (96 KB vs. 128 KB). The table below gives a full side-by-side summary of the memory hierarchies and compute power in the two devices.
Quadro RTX 5000 | Tesla V100 | |
---|---|---|
Main Memory Type | GDDR6 SDRAM | HBM2 |
Error Correction Code (ECC) Memory | yes | yes |
Main Memory Size | 16 GB | 32 GB |
Main Memory Bandwidth | 448 GB/s | 900 GB/s |
L2 Cache Size | 4 MB | 6 MB |
Shared Memory + L1 Cache Size | 96 K | 128 K |
Streaming Multiprocessors (SMs) | 48 | 80 |
FP32/INT32/FP64 CUDA Cores (per SM) | 64/64/0 | 64/64/32 |
Tensor Cores (per SM) | 8 | 8 |
GPU Boost Clock | 1.815 GHz | 1.53 GHz |
What about moving data back and forth to the host? Each of the 4 Quadros in a Frontera node is linked to the host via a separate x16 PCI Express (PCIe) 3.0 bus, which has a bidirectional bandwidth of up to 32 GB/s. While this set of connections seems more than adequate, PCIe nevertheless has the potential to become a major bottleneck for tasks involving lots of memory transactions between the host and the devices.
The limitation becomes especially apparent if one compares PCIe3 x16 to the NVLink 2.0 connections described earlier for the Tesla V100. A single V100 actually has 6 links available to use, and each of these is capable of 50 GB/s (bidirectional). If all 6 are connected to the host, then the total NVLink 2.0 bidirectional bandwidth climbs to 300 GB/s, almost an order of magnitude greater than the 32 GB/s of PCIe3. However, as discussed previously, the NVLink bandwidth of each V100 in the IBM AC922 server is split between the host and a neighboring V100, so that each gets 150 GB/s.
It turns out that the Quadro RTX 5000 possesses NVLink 2.0 capability, too, but it has only a single 50 GB/s link to work with. In Frontera, this spare link is used to couple the Quadros together within each server, so that the 4 Quadros are linked in pairs in a manner similar to the 4 V100s in the IBM AC922, but with lower total bandwidth in each connection.
Quadro RTX 5000 | Tesla V100 (IBM host) | |
---|---|---|
Host-to-Device Connection Type | 1 PCIe Gen 3 (x16) | 3 NVLink 2.0 (x8) |
Host-to-Device Bidirectional Bandwidth | 32 GB/s | 150 GB/s |
Peer-to-Peer Connection Type (per pair) | 1 NVLink 2.0 (x8) | 3 NVLink 2.0 (x8) |
Peer-to-Peer Bidirectional Bandwidth | 50 GB/s | 150 GB/s |
Tests on Frontera show that the peer-to-peer link between a pair of RTX 5000 GPUs actually tends to run slower than the PCIe3 link to the host, even though the peer-to-peer link rate is nominally faster, as seen in the above table. The NVLink 2.0 latency is better than the PCIe3 latency, however. You can test these claims for yourself later, in the bonus exercise.
There are at least two ways to overcome the bandwidth limitations of the Quadro RTX 5000: maximize the on-device memory usage, and overlap data transfers with other tasks. In particular, if you know that a task requires more than 5GB of memory, you can minimize the memory transfer overhead by streaming data to the device while other tasks are running (hence the name "streaming multiprocessor" in GPUs).