Cornell Virtual Workshop > Python for Data Science

Data Access and Input

Chris Myers (CAC), Jeff Sale (SDSC)
Cornell Center for Advanced Computing and San Diego Supercomputing Center

Revisions: 6/2023, 2/2020 (original)

In most cases, the data you want to analyze lives somewhere outside of your program, and you need to read it in. Hopefully the data are in some standardized format so that you do not have to write a custom parser to read it, or are sufficiently structured that such a parser is not nearly as big as the dataset that it is intended to read. We will not cover lower-level Python I/O here (e.g., file reads/writes), but focus instead on a few common scenarios that one might encounter in accessing and importing data:

Reading structured data (e.g., arrays and tables) from files and spreadsheets
Reading data tables from SQL databases
Accessing data through an API

Objectives

After you complete this segment, you should be able to:

Understand the structure of NumPy arrays and Pandas DataFrames
Use I/O to read in NumPy arrays and Pandas dataframes
Describe how to interface to a SQL database via Pandas
Discuss different avenues for accessing data, such as files, databases, and APIs

Prerequisites

This tutorial assumes the reader has some working knowledge of general programming concepts, even if not directly with the Python programming language. The target audience is scientists and engineers who are already programming in Python, and are interested in using Python tools and packages to carry out various analyses of datasets. If additional introductory material about Python is needed, readers can consult An Introduction to Python as well as the documentation on the python.org website.