Chris Myers (CAC), Jeff Sale (SDSC)
Cornell Center for Advanced Computing and San Diego Supercomputing Center

Revisions: 6/2023, 2/2020 (original)

Even after you've imported data into whatever language or analysis system you are working with, there is generally a need for various kinds of data cleanup and preprocessing, in order to prepare the data for further analyses. There are recurring types of data processing that arise in many different contexts, which we address in the following pages:

  • Cleaning data
  • Extracting and reorganizing data
  • Augmenting data, such as adding derived data based on the raw data imported, or integrating multiple related datasets
  • Applying aggregate operations on data and/or grouped subsets of data
  • Filtering subsets of data based on various criteria
Objectives

After you complete this segment, you should be able to:

  • Identify recurring aspects of problems with data, and how to address them
  • Use some of the tools for data cleaning within Pandas
  • Explain how to manage missing data values
  • Extract and reorganize data to get into an appropriate form for analysis
  • Apply aggregate operations to data in an array or dataframe
  • Demonstrate how to use split-apply-combine methods using and groupby methods
  • Filter data based on various conditions
Prerequisites

This tutorial assumes the reader has some working knowledge of general programming concepts, even if not directly with the Python programming language. The target audience is scientists and engineers who are already programming in Python, and are interested in using Python tools and packages to carry out various analyses of datasets. If additional introductory material about Python is needed, readers can consult An Introduction to Python as well as the documentation on the python.org website.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement