Data Formats
Data formats designed for HPC, like netCDF and HDF5, have several advantages over using plain data files. First, such files are portable to any system in which the libraries are installed, making them suitable formats for archiving. Furthermore, the I/O library can optimize reads and writes for a particular file system based on a file's built-in metadata, which contains information about the structure of the data inside the file. Likewise, routines are provided that allow one to interrogate the file and determine its properties without knowing them in advance.
This roadmap describes libraries that support parallelized access to netCDF or HDF5 data files. We focus initially on the specialized parallel interface called PnetCDF, which works with classic CDF file formats (CDF-1, CDF-2, and CDF-5), but does not extend to the enhanced, HDF5-based types of netCDF files. For the HDF5 format, we will later introduce PHDF5.
Network Common Data Format
NetCDF (or the Network Common Data Form) is a collection of software libraries developed by Unidata to manage all types of scientific data, including array-based data structures, in machine-independent data formats. Initial versions of the libraries were built to handle files stored in CDF, the Common Data Format, which remains the default file type when creating new files with netCDF. The figure below shows the basic building blocks of "classic CDF" files, comprising the CDF-1, CDF-2, and CDF-5 file types.
data:image/s3,"s3://crabby-images/a7c90/a7c90ecdeea87f963a2a6bc15cb23d3c336caf08" alt="The elements of the classic CDF data model, as described in the main text. A file has named variables, dimensions, and attributes. Variables also have attributes. Variables may share dimensions, indicating a common grid. One dimension may be of unlimited length."
The diagram illustrates two key characteristics of data in netCDF: structured and self-describing. Variables within a file need to be accompanied by dimensions and attributes, as well as one of the six primitive data types that are available (char, byte, short, int, float, or double). The inclusion of this metadata means that the file itself holds the information that netCDF needs to read and process the file.
HDF5
HDF5 (the Hierarchical Data Format, version 5) is a library, data model, and file format designed for managing data, especially in HPC where efficient I/O is a high priority. HDF5 has two types of objects: groups and datasets. A group defines hierarchical relationships among objects. A dataset object is a (multidimensional) array. Object names are accessible using UNIX-style paths, which reflect the hierarchal structure of the file. Annotations or metadata can be attached to objects as HDF5 attributes. The HDF5 model is versatile, capable of representing complex, heterogeneous data objects in a self-contained, portable format.
NetCDF-4
The latest version of the libraries, netCDF-4, differs from its predecessors in that it also supports an "enhanced data model" where the files are stored not in CDF, but in a very similar format, HDF5. The HDF5-based format allows different netCDF-4 variables to be arranged into a hierarchy of groups. (HDF5, the parent format, is more general in that it does not require a strict hierarchy.) The below figure shows the enhanced data model used within netCDF-4.
data:image/s3,"s3://crabby-images/5acef/5acef9c32b29325f5735f1ab05f5fc80989487a2" alt="The netCDF data model, as described in the main text. A file has a top-level unnamed group. Each group may contain one or more named subgroups, user-defined types, variables, dimensions, and attributes. Variables also have attributes. Variables may share dimensions, indicating a common grid."
As shown in the diagram, the top-level data group may contain one or more named subgroups, all of which can possess the properties already described for CDF. Moreover, with netCDF-4, data types may be chosen from among 12 predefined primitive types or user defined. The primitive types include char, string, float, and double, and signed and unsigned variants of byte, short, int and int64.