Twitter Data
You are probably already familiar with the Twitter social network, which is built on a messaging system that allows people to send and receive messages (and attached images) and other limited types of metadata. It has been used effectively as a near real-time communications tool in numerous global social events. Twitter also provides an API that allows anyone to collect large amounts of data and perform a wide range of analyses to better understand these networks. This strength can also be a weakness, introducing vulnerabilities to ‘bots’ or ‘state-backed’ accounts which are used to spread disinformation in critical ways.
In these lessons, you will learn how to use the Twitter Streaming API to collect tweets using the Python tweepy library. We will then perform some basic analysis and visualization of the tweet data, and will later perform graph analysis using networkx and some other packages to gain insight into the top influencers in the network.
About the Twitter Dataset
The Twitter dataset we will be using was created using the Twitter Streaming API and hashtag '#climatechange'. This is a particularly interesting dataset since it was collected from November 20 to December 5, 2018 during which the U.S. government released its latest findings on climate change. We will learn more about the dataset in a subsequent section.
We will collect data using Twitter Streaming API as discussed later in Accessing Data via API: Twitter. For now, to get an idea how much data is in a single tweet, we present an example of a tweet in JSON format (you'll need to scroll right to see all of it):
All tweets have this type of content in general, but the details may vary depending on things such as whether it is an original tweet or a retweet. Retweets include a "retweeted_status" parameter which contains all of the information about the original tweet which was retweeted. Other differences may be whether the tweet is geolocated with a latitude and longitude of the user's location, and so on.
Twitter Search API
We will not be using the Twitter Search API in these lessons but it is useful to be aware of its purpose. The main difference between using the Search API and the Streaming API is the Search API is typically for collecting past tweet data whereas the Streaming API is for current near real-time tweet data.
Twitter limits the Search API to roughly 100 tweets per 15 minute epoch. If you include a parameter to maximize the tweet count during a 15 minute epoch and that parameter exceeds the number of tweets during that 15 minute period, tweets which are duplicates of a previous 15 minute epoch will be included and must be filtered prior to analysis.
The Search API is often used to collect tweets within the past week or two and can be useful for filling in gaps of recent tweet activity. However, Twitter limits how far back in time you can search using a hashtag search query to roughly 1-2 weeks.
You can search a Twitter user's timeline and collect up to the most recent 3200 tweets regardless of how far back in time they occurred. You can find more online about the Twitter Search API.