Our first set of practical examples will use the Baseball Databank, a compilation of historical Major League Baseball (MLB) data maintained by the Chadwick Baseball Bureau, based upon earlier work by Sean Forman, Sean Lahman, and others. The Baseball Databank, along with other relevant open baseball datasets, can be found in the baseballdatabank git repository. As described in that repository, the data as provided at that site constitute a legacy resource, useful for casual or exploratory use, or as a convenient dataset for students and researchers to practice their data science skills. If you need to delve more deeply into data maintained by the Chadwick Baseball Bureau for serious research or publication purposes, you should contact them for further information.

It should also be noted that the data in the Baseball Databank are updated after the completion of every season of Major League Baseball (MLB), with the most recent data included (as of the time of this writing) from the 2022 season. Some of the static graphics and narrative text contained in this tutorial, however, were developed with data up through the 2019 season. If you choose to run the code in the associated Jupyter notebook, you will presumably be running with the most recent version of the data, so you might notice discrepancies between specific numbers reported as part of the analysis outputs.

The previous page instructed you how to clone the baseballdatabank repository as a submodule in our tutorial repository. Alternatively, if you just want the baseball data, you can clone that repository directly.

Once you have downloaded the baseball data, you will find it contains three folders: 'core', 'upstream', and 'contrib'. We will be working with data in both the 'core' and 'contrib' folders. Those folders contain a long list of data files in a Comma-Separated Values (CSV) format, a common way of storing tabular data in a textual "flat file". We list the files below:

core
  • AllstarFull.csv
  • Appearances.csv
  • Batting.csv
  • BattingPost.csv
  • Fielding.csv
  • FieldingOF.csv
  • FieldingOFsplit.csv
  • FieldingPost.csv
  • HomeGames.csv
  • Managers.csv
  • ManagersHalf.csv
  • Parks.csv
  • People.csv
  • Pitching.csv
  • PitchingPost.csv
  • SeriesPost.csv
  • Teams.csv
  • TeamsFranchises.csv
  • TeamsHalf.csv
contrib
  • AwardsManagers.csv
  • AwardsPlayers.csv
  • AwardsShareManagers.csv
  • AwardsSharePlayers.csv
  • CollegePlaying.csv
  • HallOfFame.csv
  • Salaries.csv
  • Schools.csv

The file Batting.csv, for example, contains batting records for all MLB players throughout history (even for those players, like "Moonlight" Graham, who never came to bat). The first few lines of the file reveal the structure of the data, showing the aggregated batting statistics for each player in each season that they played:

playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,0
addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,0
allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,1
allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,0
ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,0
armstbo01,1871,1,FW1,NA,12,49,9,11,2,1,0,5,0,1,0,1,,,,,0
barkeal01,1871,1,RC1,NA,1,4,0,1,0,0,0,2,0,0,1,0,,,,,0
barnero01,1871,1,BS1,NA,31,157,66,63,10,9,0,34,11,6,13,1,,,,,1
barrebi01,1871,1,FW1,NA,1,5,1,1,1,0,0,1,0,0,0,0,,,,,0

The first entry in each column, the playerID, indicates a player; the playerID recurs in other csv files to identify those players, with biographical information about each player in the file People.csv. Most of the rest of the columns describe different hitting statistics associated with that player during that year.

We will revisit these data later, using various Python packages for data processing and analysis.

In addition to the Baseball Databank, another extremely valuable resource for baseball data is Retrosheet, which is also accessible in the retrosheet git repository maintained by the Chadwick Bureau. Retrosheet provides an even finer-grained dataset capturing baseball history, with each game played captured as a row in a csv file containing 161 columns characterizing various game attributes. Even for the more casual fan who is not interested in a deep dive in baseball statistics, Retrosheet can be a very useful resource if one wants to try to figure out what game they attended or watched sometime in the past.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement