Cornell Virtual Workshop > Python for Data Science > Overview

Forms of Data

Data come in many different forms, formats, sizes, shapes and conduits. Understanding the structure of data that you will be working with will facilitate your choice of tools needed to do that work.

Regular Data

Scientific data are often regular in structure, and can be efficiently represented with multidimensional arrays. Common examples include matrices used in linear algebra, the 3-dimensional coordinates of every particle in a system (represented, e.g., as an N x 3 array of floating point numbers), a set of data relating measurements of an independent and dependent variable (represented as an N x 2 array), or a continuous field discretized into voxels in a 3-dimensional volume (represented as an Nx x Ny x Nz array).

Heterogeneous Data

Datasets can be heterogeneous in any number of ways. Some datasets are aggregates of multiple sub-datasets (each of which might be more regular in structure), which are related to each other in meaningful ways. Some represent data that are defined on an irregular or complex geometry, such as an unstructured mesh, an underlying network, or an inhomogeneous sampling of points in space. Some are regularly structured tables of observations, but where variables being observed are of different type or structure (see tabular data below for a fuller discussion). As a result, there is no "one size fits all" data structure for capturing this diversity of structures, although there are commonly recurring types of structures (hierarchical collections, meshes, networks, tables, etc.) and associated tools for managing them. Identifying the types of heterogeneity in data, and the interactions among its elements, is a first step in pinpointing useful tools for data processing.

Tabular Data

An important class of heterogeneous data, alluded to above, is tabular data. These are two-dimensional datasets of the sort found in spreadsheets, with a set of rows and columns. Data in any particular column are generally of the same type, but columns of different types might be mixed together in a data table. The presence of row and/or column labels can enable various data queries based on the identity of those labels. While in principle a spreadsheet or data table can have very little organization or structure, with any data type in any cell, various conceptual and computational efficiencies are gained if one imposes structure. In particular, a canonical form of tabular data -- known in some circles as "tidy" data -- imposes a particular type of structure that is often useful.

Relational Data

Relational data, at least informally here, are the sort found in relational databases, involving a collection of tables (as in tabular data, above) that are linked to each other through one or more shared keys in the tables.

Network Data

Networks (or graphs) are mathematical objects in which a set of nodes (or vertices) are connected by a set of edges (or links). As such, they are broadly useful in encoding different types of associations between different entities, where the associations are represented by network edges and the entities are represented by the nodes. The field of "network science" has emerged in part because of the availability of rich datasets on a wide variety of different types of networks (social, biological, technological), and in part because examination of the structure of disparate networks has revealed some common themes across networks that help to provide insights into the relationship between structure and function.

"Big" Data

A variety of technologies have made increasingly available many large datasets, which hold the potential for new and powerful analyses of complex phenomena. Data can be "big" in various ways, but an important question is: "big compared to what?" Datasets might be large compared to what can fit in memory on a machine you are running on, which might either suggest that you find a bigger machine or process the data in smaller chunks. Datasets might be sufficiently large that moving them around among different acquisition, storage, and computing resources imposes a substantial burden or delay. Or analyses on large datasets might take substantially longer than for smaller datasets (especially if the underlying algorithms you are using have run times that scale superlinearly with the problem size), in which case you might want to investigate parallel computing resources. For some, "Big Data" implies an ecosystem of algorithm, software and hardware resources intended to facilitate certain types of computations on large datasets (e.g., MapReduce). If you are working with tabular data, your dataset might be "big" in terms of the number of rows (e.g., the number of records or observations), the number of columns (e.g., the number of variables per observation), or both. But "Big Data" are often not as big as they seem: even if they are embedded in some high-dimensional space, data often lie on lower-dimensional subspaces or manifolds that reflect important relationships among elements and which can be identified with dimensionality reduction techniques.

Image Data

Image data are becoming more common in data science applications. There are a wide range of image formats, resolutions, and color depths to work with. Images represent a particular class of regular data, based on a "raster" format of rows and columns of pixels, with numerical values associated with one or more channels (e.g., 8-bit grayscale or 32-bit RGBA (red, green, blue, alpha)). These underlying numerical values (often represented with arrays) enable a variety of image processing tasks, such as edge detection, de-blurring, pattern recognition, compression, noise reduction, segmentation, differencing, histograms, and more. Many image processing libraries exist to support these standard sorts of operations, but access to the lower-level numerical representations allows for the construction of custom image processing algorithms and pipelines as well.

Textual Data

Text has traditionally served as an important conduit for metadata (that is, describing the primary data of interest), but the availability of text from many sources has increasingly made text a primary data source on its own. This includes textual data in books, scientific papers, news articles, social media, historical archives, government and NGO documents, etc. Computational linguists work with large text corpi that might consist of terabytes of data, and many tools and libraries supporting sophisticated methods in Natural Language Processing (NLP) have been developed to support that work.

Streaming Data

Streaming data is more of a reference to the state in which data is being managed. Streaming data can be in a variety of forms, such as text, numbers, images, etc. Twitter data is an example of streaming data and can be accessed with the Twitter Streaming API. Weather data often streams live from distributed weather sensor networks all over the world. Often, streaming data is so voluminous that it gets processed and filtered before it is saved and the original data is discarded.

Hierarchical and Aggregated Data

Complex data structures and data formats often aggregate many related pieces of information, and are often organized hierarchically, with subsets of data nested within higher levels. A variety of formats support these types of data descriptions. JSON or XML formats are widely used examples, extending well beyond use in just data science and scientific computing. Uncompressed, these text formats use more memory but they offer the advantage of conforming to a defined format that supports standardization and access across different analysis tools. More targeted to large scientific datasets are HDF5 and NetCDF, two data formats (and associated libraries) that support the hierarchical aggregation of array-oriented data.

Back