Data Access and Input
Chris Myers (CAC), Jeff Sale (SDSC)
Cornell Center for Advanced Computing and San Diego Supercomputing Center
Revisions: 6/2023, 2/2020 (original)
In most cases, the data you want to analyze lives somewhere outside of your program, and you need to read it in. Hopefully the data are in some standardized format so that you do not have to write a custom parser to read it, or are sufficiently structured that such a parser is not nearly as big as the dataset that it is intended to read. We will not cover lower-level Python I/O here (e.g., file reads/writes), but focus instead on a few common scenarios that one might encounter in accessing and importing data:
- Reading structured data (e.g., arrays and tables) from files and spreadsheets
- Reading data tables from SQL databases
- Accessing data through an API
Objectives
After you complete this segment, you should be able to:
- Understand the structure of NumPy arrays and Pandas DataFrames
- Use I/O to read in NumPy arrays and Pandas dataframes
- Describe how to interface to a SQL database via Pandas
- Discuss different avenues for accessing data, such as files, databases, and APIs
Prerequisites
This tutorial assumes the reader has some working knowledge of general programming concepts, even if not directly with the Python programming language. The target audience is scientists and engineers who are already programming in Python, and are interested in using Python tools and packages to carry out various analyses of datasets. If additional introductory material about Python is needed, readers can consult An Introduction to Python as well as the documentation on the python.org website.