Cornell Virtual Workshop > Python for Data Science > Preparing & Processing Data

Extracting and Reorganizing Data

Sometimes the data you want to analyze comes to you in a format that is not optimal for your needs, or contains additional data fields that you might not be interested in analyzing. In such a case, you might want to preprocess the data to get it in a suitable form for further analysis. We illustrate some of these sorts of issues here, in the context of our Twitter data. As noted previously, the Twitter data files are not stored in our accompanying github repository, but are instead available through separate download links.

In order to download the Twitter data files, follow the instructions included in the README file in the data/twitter subdirectory of our accompanying github repository. The code described in this section is available in the associated Jupyter notebook on extracting Twitter data.

Extracting and Reorganizing Twitter Data

The standard format for results returned from a Twitter API search is a JSON (JavaScript Object Notation) object containing approximately 40 different fields. For our analysis focused on retweets, we are not interested in all those fields, but only a subset, as described below. Our general strategy will be to use the json and csv modules, both part of the Python Standard Library, in order to load the initial JSON data from files, extract fields of interest, and save the reduced dataset in a CSV file for further processing. The json module contains the method json.loads that converts a JSON object into a Python dictionary. In the code below, we read in all the tweets in a JSON file, extract information of interest from the resulting dictionary, and write that subset of data back out to a CSV file, with one row per tweet. One of the challenges of working with textual data such as that found in tweets is that they can contain newline (\n) and carriage return (\r) characters that can complicate text processing; in the code below, we simply replace them each with a space so that we don't have to deal with them further.

For this example, we will not be using our full dataset since it is a violation of the Twitter Terms of Service to share more than 50,000 tweets in a day. However, even though our full dataset contains more than 450,000 tweets, we reduced the total file size from approximately 3.5 Gb down to 119 Mb by eliminating most of the 40 or so fields that are not of interest to us, thereby adhering to the spirit of intent of the Twitter ToS. For this example, we provide you with two JSON format files containing original tweet data collected with the Streaming API to give you at least a sense of how the process works. You should be able to use this code to process your own data.

The fields we chose to retain in the CSV include the following: id, created_at, lang, user screen_name, user created_at, user id, user followers_count, user friends_count, user time_zone, user utc_offset, retweeted_status, retweeted_status id, retweeted_status user screen_name, retweeted_status user id. The code below reads all files in a folder, extracts data from the subset of fields, and saves the results in a comma-separated-values (CSV) format to a file named 'climatechange_tweets_sample.csv'.

Tweaking the CSV File to Accommodate Timestamps

The CSV file we just wrote is a much smaller distillation of the data of interest to us, but we'll want to tweak it a little bit so that we can work with it in pandas more easily.

Two of the fields in the CSV file represent timestamps: tweet_created_at and user_created_at. The pd.read_csv function can be passed an additional argument indicating that particular columns should be parsed as dates (datetimes), instead of plain strings. Unfortunately, the Twitter data that we extracted from the JSON file is in a format that is not the standard datetime format, e.g., Thu Nov 29 19:22:55 +0000 2018. Pandas has a function named to_datetime that can not only convert strings to datetime objects, but can infer datetimes from a number of different formats. That inference can be rather slow for a large file, however.

Fortunately, the pd.to_datetime function can be provided with a specific format string, so that it does not need to infer a format. By providing an explicit format string to the conversion function for our tweet timestamps, the conversion is quite fast (a few seconds, even for the full data file). Format strings are based upon the Python strftime format codes. If we do not provide a format hint of this sort, the conversion takes more than a minute for each function call. In the code that follows, we:

read the CSV file into pandas
convert two of the fields that are timestamps to datetime objects with a specific format string
write out a new CSV file with the reformatted data

Working Further With the Reformatted CSV File

Having tweaked the datetime formats, if we want to read our new CSV file into a pandas dataframe, we can augment the pd.read_csv call to specify which columns to parse as dates, using the parse_dates option. If we had not previously altered the datetime formats, this call to pd.read_csv would be very slow due to the need to do datetime format inference. Now that we have fixed the formats, the date parsing proceeds quickly. The pd.read_csv function, however, does not allow one to specify a format string, which is why we needed to do the conversion with the pd.to_datetime function as above. Here's what the new read_csv function call looks like:

Back