Filtering Data
Even after you have cleaned and preprocessed your data in various ways, you might want to further filter it, perhaps to extract a subset of the data for some specific processing. In earlier material on arrays and dataframes, we discussed the process of selecting a subset of data by indexing and slicing. We include here links to online documentation:
Selecting a Subset of Data Based on a Logical Condition
Another common mode of operation across different Python functions and packages is the ability to select a subset of data in a container based upon some specified logical condition (i.e., a test or function that returns either True or False). These include:
- The built-in Python function
filter
, which takes a function and an iterable as inputs, and returns an iterator yielding those items of the iterable for which the function applied to an item is True. - Boolean array indexing in NumPy, which supports accessing elements of an array by indexing with an array of Booleans; if one, for example, wanted to select only those elements of a numpy array
x
which are greater than zero, the following expression will return a new array satisfying that constraint:x[x > 0]
. In this example, the expressionx > 0
returns an array of Booleans of the same shape asx
, and thenx[x > 0]
extracts the subset of the array for whichx > 0
. (Note: if x is multidimensional, then the Boolean array will also be multidimensional, but the extraction of elements that satisfy the test will result in a one-dimensional array.) The documentation describes other NumPy indexing routines. - The
numpy.where
function, which provides a mechanism for either identifying those indices in a numpy array where a particular Boolean condition is satisfied, or to construct a new array based on where such a condition is satisfied. See the numpy.where documentation for further information. - Boolean indexing in pandas, which builds upon the same mechanism described above for numpy. We can select subsets of a dataframe that meet some specified condition, as described in more detail below.
Filtering the Baseball Dataset
As an example of Boolean indexing in pandas, imagine we want to identify the top hitters in some category, but don't want those results to be contaminated by small sample size, that is, we don't want to include those players who batted so little that their statistics are not trustworthy. In pandas, we can easily impose a cutoff, based on a minimum number of at-bats (AB), by indexing on the condition pl_bat.AB >= min_AB
. For example, if we wanted to list the players with all-time top slugging percentage ('SLG'), we could issue the following command:
The expression pl_bat.AB >= 100
creates a pandas Series of Booleans (depending on whether AB >=100 for a given row) and then uses that to select only the subset of rows in pl_bat
for which that Series is True, and returns a new dataframe. On that reduced dataframe, we can call additional methods such as sort_values
.
Filtering the Wildfire Dataset
We previously did some preprocessing on the Wildfire dataset to clean up some of the original data and get ALARM_DATEs and CONT_DATEs consistently represented as datetime objects. Along with our skills in data augmentation, we can create a new data column representing the overall temporal duration of a fire. Because the CONT_DATE and ALARM_DATE columns both contain datetime64 objects, we can subtract one from the other, resulting in a timedelta object. From that we can extract the number of days after the alarm was sounded to when the fire was contained, and inspect the range of those durations (i.e., the min and the max duration):
Oddly, we can see that some of the fires are listed as having negative durations, i.e., they were contained before they were even identified, which is probably the result of some data entry error. If we were to run dffires[dffires.DURATION < 0]
. we could see the subset of fires for which this is the case (there are 9 in all). We can also see in the results above that there is a fire that is purported to have lasted for over 3000 days (9 years), which is probably also due to a data entry error. Depending on how you want to work with this dataset, you might want to try to correct these errors, but often it is easier just to toss them out. We can do this by filtering the dataframe so that we can proceed with further analysis. This filtering operation includes only those rows with DURATION >=0 and DURATION < 3000: