Cornell Virtual Workshop > Python for Data Science > Machine Learning

Big Data

A variety of technologies have made increasingly available many large datasets, which hold the potential for new and powerful analyses of complex phenomena. Data can be "big" in various ways, but an important question is: "big compared to what?" Datasets might be large compared to what can fit in memory on a machine you are running on, which might either suggest that you find a bigger machine or process the data in smaller chunks. Datasets might be sufficiently large that moving them around among different acquisition, storage, and computing resources imposes a substantial burden or delay. Or analyses on large datasets might take substantially longer than for smaller datasets (especially if the underlying algorithms you are using have run times that scale superlinearly with the problem size), in which case you might want to investigate parallel computing resources or consider alternative algorithms that scale better. "Big Data" are often not as big as they seem, however: even if they are embedded in some high-dimensional space, data often lie on lower-dimensional subspaces or manifolds that reflect important relationships among elements and which can be identified with dimensionality reduction techniques.

For some, "Big Data" implies an ecosystem of algorithms, software and hardware resources intended to facilitate certain types of computations on very large datasets. Hadoop and Spark are two such widely used ecosystems. The incorporation of hardware resources, such as clusters and distributed filesystems, make these environments different from and more complicated than the software libraries that we have been focused on here. Hadoop is written in Java, although there are various tools that have been developed to enable access to various aspects of the Hadoop ecosystem from Python. Spark is also written in Java, although the PySpark interface provides a very effective framework for carrying out Spark-based analyses in Python.

Spark centers around the concept of Resilient Distributed Datasets (RDDs), which are collections of elements partitioned across the nodes of the cluster that can be operated on in parallel. At their core, RDDs store and support operations on key-value pairs, similar to dictionaries that are part of the Python language. Every Spark application consists of a driver program that runs the user's main function on a cluster, and thus PySpark serves as an alternate Python interpreter that is Spark-aware (i.e., one would not just start up the default python interpreter and type "import pyspark"). An overview of PySpark and RDDs are beyond the scope of this tutorial, but fortunately the regular HPC Workshop on Machine Learning and BIG DATA hosted by the Pittsburgh Supercomputing Center (PSC) provides an excellent introduction to these, along with hands-on exercises that demonstrate how the system works in practice. (Note: the link above is to a recent version of the workshop; go to the High Performance Computing Workshop Series page to find out about upcoming events.)

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)