Cornell Virtual Workshop > Python for Data Science > Preparing & Processing Data

Augmenting Data

Combining datasets

Datasets sometimes come from different sources, but can be combined in useful ways to produce larger datasets. Pandas provides substantial functionality to Merge, Join, and Concatenate dataframes in a manner similar to the sorts of join operations provided by databases. We will consider the issue of concatenation here.

As was mentioned previously, the CAL FIRE data contains multiple worksheets, including information about fires in 2017 that were not included in the aggregated historical data in the 2016 sheet. We can read in that additional 2017 data and concatenate it with the 2016 data to build a complete dataset. In doing so, however, we'll have to do a bit more data cleaning.

First, let's read in the 2017 sheet, and compare the column headings for the 2016 and 2017 dataframes. We're assuming that the columns in each dataframe will be the same, so that we can concatenate along the row-wise dimension (axis=0).

df17 = pd.read_excel('../data/wildfires/Fires_100.xlsx', sheet_name='2017', na_values=['<Null>'],
                   converters= {'ALARM_DATE': pd.to_datetime, 'CONT_DATE': pd.to_datetime})
print(df17.columns)
print(df16.columns)

The output that is printed reads:

Index(['YEAR_', 'STATE', 'AGENCY', 'UNIT_ID', 'FIRE_NAME', 'INC_NUM',
       'ALARM_DATE', 'CONT_DATE', 'CAUSE', 'COMMENTS', 'REPORT_AC',
       'GIS_ACRES', 'C_METHOD', 'OBJECTIVE'],
      dtype='object')
Index(['YEAR', 'STATE', 'AGENCY', 'UNIT_ID', 'FIRE_NAME', 'INC_NUM',
       'ALARM_DATE', 'CONT_DATE', 'CAUSE', 'COMMENTS', 'REPORT_AC',
       'GIS_ACRES', 'C_METHOD', 'OBJECTIVE'],
      dtype='object')

The column names look almost identical, but we can see that the first column in df17 is "YEAR_" and the first in 2016 is "YEAR". In order to be able to concatenate these dataframes smoothly, we'll want to rename the column in df17, which we can do with the rename method on the dataframe.

df17.rename(columns={'YEAR_': 'YEAR'}, inplace=True)

Having renamed the column (which we did in place, i.e., to modify the dataframe directly), we can proceed with concatenation to produce a new dataframe with all the data, which we will use for subsequent analyses. Note that since we are concatenating to add additional rows to our initial dataframe, we'll want to do so along axis=0:

dffires = pd.concat((df16, df17), axis=0)

Adding derived data

Sometimes the raw dataset we are working with contains the basic information we need to carry out an analysis of interest, but not in a form that is immediately useful for us. In this case, we might want to augment the dataset with new data elements derived from the raw data. Because pandas dataframes are an easily extensible data structure, it is straightforward to add those new data elements very succinctly. We will illustrate some of this in the context of the baseball and wildfire datasets.

The baseball batting dataset contains a lot of information about batting, but oddly it does not contain information about the number of singles (i.e., one-base hits in which the batter reached first base safely and remained there until the next batter batted). In baseball, there are 4 kinds of hits:

singles (one base, or 1B)
doubles (two bases, or 2B)
triples (three bases, or 3B)
home runs (four bases, or HR)

Since the dataset contains information about the total number of hits H, as well as the number of 2B, 3B and HR, we can define an additional column in both the batting and teams dataframes to compute the number of singles (1B), which is just H-(2B+3B+HR). A peek into one of the dataframes will reveal that the 1B data has been added as the last column.

batting['1B'] = batting['H'] - batting['2B'] - batting['3B'] - batting['HR']
teams['1B'] = teams['H'] - teams['2B'] - teams['3B'] - teams['HR']

Exercise on data augmentation: The Slash Line

Now that you know how to add additional data columns to your dataframe, perhaps you'd like to get some hands-on experience. If so, you can proceed to the Baseball Jupyter notebook [LINK] and work through the exercise entitled The Slash Line. This exercise enables you to compute additional baseball statistics from the raw data in the batting dataframe. The "Slash Line" — the trio of batting statistics known as Batting Average (BA), On-Base Percentage (OBP), and Slugging Percentage (SLG) — are central to modern baseball data analytics, and can be easily calculated based on the data we have at hand. Once you're done with that, come back here and rejoin this discussion.

Back