Cornell Virtual Workshop > Python for Data Science

Modeling and Statistics

Chris Myers (CAC), Jeff Sale (SDSC)
Cornell Center for Advanced Computing and San Diego Supercomputing Center

Revisions: 6/2023, 1/2021 (original)

Much of data science involves building models, although there are many different kinds of models. Statistics primarily involves building descriptive models of data themselves. Mechanistic models describe sets of interacting processes, such that data emerge through the collective action of those components. Machine learning models often lie somewhere in the middle, producing descriptions of data that are learned through the organization and parameterization of flexible computational frameworks such as neural networks or decision trees. In this topic, we touch briefly on some tools that support either statistical modeling, or the integration of data and models that often arises when one is trying to parameterize a mechanistic model. In the following topic, we address some tools that are useful for machine learning.

Objectives

After you complete this segment, you should be able to:

Use pandas to generate descriptive statistics of data
Use statsmodels to build and analyze statistical models of data
Use scipy to integrate models and data by estimating optimal model parameters
Use networkX to build and analyze network models from data
Build a simulation of wildfire dynamics to investigate dynamical self-organization

Prerequisites

This tutorial assumes the reader has some working knowledge of general programming concepts, even if not directly with the Python programming language. The target audience is scientists and engineers who are already programming in Python, and are interested in using Python tools and packages to carry out various analyses of datasets. If additional introductory material about Python is needed, readers can consult An Introduction to Python as well as the documentation on the python.org website.