Our first set of practical examples will use the Baseball Databank, a compilation of historical Major League Baseball (MLB) data maintained by the Chadwick Baseball Bureau, based upon earlier work by Sean Forman, Sean Lahman, and others. The Baseball Databank, along with other relevant open baseball datasets, are described and linked to here, with the current version contained in the baseballdatabank git repository.
The previous page instructed you how to clone the baseballdatabank repository as a submodule in our tutorial repository. Alternatively, if you just want the baseball data, you can clone that repository directly.
Once you have downloaded the baseball data, you will find it contains two folders, 'core' and 'upstream'. The 'core' folder contains the databank itself, and 'upstream' contains files used to construct the databank. The 'core' folder contains a long list of data files in a Comma-Separated Values (CSV) format, a common way of storing tabular data in a textual "flat file". We list the files below:
- AllstarFull.csv
- Appearances.csv
- AwardsManagers.csv
- AwardsPlayers.csv
- AwardsShareManagers.csv
- AwardsSharePlayers.csv
- Batting.csv
- BattingPost.csv
- CollegePlaying.csv
- Fielding.csv
- FieldingOF.csv
- FieldingOFsplit.csv
- FieldingPost.csv
- HallOfFame.csv
- HomeGames.csv
- Managers.csv
- ManagersHalf.csv
- Parks.csv
- People.csv
- Pitching.csv
- PitchingPost.csv
- Salaries.csv
- Schools.csv
- SeriesPost.csv
- Teams.csv
- TeamsFranchises.csv
- TeamsHalf.csv
The file Batting.csv, for example, contains batting records for all MLB players throughout history (even for those players, like "Moonlight" Graham, who never came to bat). The first few lines of the file reveal the structure of the data, showing the aggegated batting statistics for each player in each season that they played:
playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,0
addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,0
allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,1
allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,0
ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,0
armstbo01,1871,1,FW1,NA,12,49,9,11,2,1,0,5,0,1,0,1,,,,,0
barkeal01,1871,1,RC1,NA,1,4,0,1,0,0,0,2,0,0,1,0,,,,,0
barnero01,1871,1,BS1,NA,31,157,66,63,10,9,0,34,11,6,13,1,,,,,1
barrebi01,1871,1,FW1,NA,1,5,1,1,1,0,0,1,0,0,0,0,,,,,0
The first entry in each column, the playerID, indicates a player; the playerID recurs in other csv files to identify those players, with biographical information about each player in the file People.csv. Most of the rest of the columns describe different hitting statistics associated with that player during that year.
We will revisit these data later, using various Python packages for data interrogration and analysis.
In addition to the Baseball Databank, another extremely valuable resource for baseball data is Retrosheet, which is also accessible in the retrosheet git repository maintained by the Chadwick Bureau. Retrosheet provides an even finer-grained dataset capturing baseball history, with each game played captured as a row in a csv file containing 161 columns characterizing various game attributes.