Cornell Virtual Workshop > Python for Data Science

Preparing & Processing Data

Chris Myers (CAC), Jeff Sale (SDSC)
Cornell Center for Advanced Computing and San Diego Supercomputing Center

Revisions: 6/2023, 2/2020 (original)

Even after you've imported data into whatever language or analysis system you are working with, there is generally a need for various kinds of data cleanup and preprocessing, in order to prepare the data for further analyses. There are recurring types of data processing that arise in many different contexts, which we address in the following pages:

Cleaning data
Extracting and reorganizing data
Augmenting data, such as adding derived data based on the raw data imported, or integrating multiple related datasets
Applying aggregate operations on data and/or grouped subsets of data
Filtering subsets of data based on various criteria

Objectives

After you complete this segment, you should be able to:

Identify recurring aspects of problems with data, and how to address them
Use some of the tools for data cleaning within Pandas
Explain how to manage missing data values
Extract and reorganize data to get into an appropriate form for analysis
Apply aggregate operations to data in an array or dataframe
Demonstrate how to use split-apply-combine methods using and groupby methods
Filter data based on various conditions

Prerequisites

This tutorial assumes the reader has some working knowledge of general programming concepts, even if not directly with the Python programming language. The target audience is scientists and engineers who are already programming in Python, and are interested in using Python tools and packages to carry out various analyses of datasets. If additional introductory material about Python is needed, readers can consult An Introduction to Python as well as the documentation on the python.org website.