Frontera's GPU Subsystems
At inception, the leadership-class Frontera system at the Texas Advanced Computing Center included two GPU subsystems. The one shown in the top figure, called "Longhorn", was suitable for double-precision work. Prior to its decommissioning in 2022, it had over 400 NVIDIA Tesla V100 GPU accelerators hosted in 100+ IBM POWER9-based AC922 servers, with 4 GPUs per server. A full account of the properties of the Tesla V100 is found in a prior topic of the Understanding GPU Architecture roadmap. The remaining subsystem, which can be accessed via special queues on Frontera, consists of 360 NVIDIA Quadro RTX 5000 graphics cards hosted in Dell/Intel Broadwell-based servers, again featuring 4 GPUs per server. Frontera's original pair of GPU subsystems combined to contribute 11 petaflop/s of single precision computing power to Frontera, serving to accelerate artificial intelligence, machine learning, and molecular dynamics research.
Interestingly, due to the very high concentrations of heat-generating computing power, Frontera's design includes special features to cool its components so they can run at top speed. Nearly all its racks and servers are water cooled, since standard air cooling with fans would be insufficient. The NVIDIA Quadros in particular are cooled in a very unusual way: as shown in the second figure, they are completely submerged in baths of liquid coolant, a solution developed by GRC. (The V100s in the former Longhorn subsystem happened to be air-cooled. However, if Longhorn had possessed 6 V100s per node instead of 4, then the water-cooled variant of the IBM AC922 servers would have been required.)
In the pages to come, we'll be taking a deep dive into the RTX 5000, to see what makes it attractive for doing GPGPU.