Cornell Virtual Workshop > Python for Data Science > Datasets

Twitter Data

You are probably already familiar with the Twitter social network, which is built on a messaging system that allows people to send and receive messages (and attached images) and other limited types of metadata. It has been used effectively as a near real-time communications tool in numerous global social events. Twitter also provides an API that allows anyone to collect large amounts of data and perform a wide range of analyses to better understand these networks. This strength can also be a weakness, introducing vulnerabilities to ‘bots’ or ‘state-backed’ accounts which are used to spread disinformation in critical ways.

In these lessons, you will learn how to use the Twitter Streaming API to collect tweets using the Python tweepy library. We will then perform some basic analysis and visualization of the tweet data, and will later perform graph analysis using networkx and some other packages to gain insight into the top influencers in the network.

About the Twitter Dataset

The Twitter dataset we will be using was created using the Twitter Streaming API and hashtag '#climatechange'. This is a particularly interesting dataset since it was collected from November 20 to December 5, 2018 during which the U.S. government released its latest findings on climate change. We will learn more about the dataset in a subsequent section.

We will collect data using Twitter Streaming API as discussed later in Accessing Data via API: Twitter. For now, to get an idea how much data is in a single tweet, we present an example of a tweet in JSON format (you'll need to scroll right to see all of it):

{"created_at":"Tue Nov 27 00:19:11 +0000 2018","id":1067211197412388864,"id_str":"1067211197412388864","text":"RT @DocsEnvAus: @susanprescott88 paediatrician representing the @TheRACP - #ClimateChange already affecting the physical and mental health\u2026","source":"\u003ca href=\"https:\/\/mobile.twitter.com\" rel=\"nofollow\"\u003eTwitter Lite\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2479088160,"id_str":"2479088160","name":"Marie Coleman","screen_name":"MarieCo92176893","location":"Canberra","url":"http:\/\/www.nfaw.org","description":"feminist, social policy analyst. All comments personal views","translator_type":"none","protected":false,"verified":false,"followers_count":1874,"friends_count":260,"listed_count":83,"favourites_count":103790,"statuses_count":101995,"created_at":"Tue May 06 01:55:58 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"40203A","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/794101726374600704\/vykECRS2_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/794101726374600704\/vykECRS2_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2479088160\/1478164006","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Mon Nov 26 23:03:25 +0000 2018","id":1067192127124164612,"id_str":"1067192127124164612","text":"@susanprescott88 paediatrician representing the @TheRACP - #ClimateChange already affecting the physical and mental\u2026 https:\/\/t.co\/AzaqkwAoj3","display_text_range":[0,140],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":1213283748,"in_reply_to_user_id_str":"1213283748","in_reply_to_screen_name":"susanprescott88","user":{"id":1542665564,"id_str":"1542665564","name":"DrsForTheEnvironment","screen_name":"DocsEnvAus","location":"Australia","url":"http:\/\/www.dea.org.au","description":"Non-profit organisation dedicated to improving the environment for human health","translator_type":"none","protected":false,"verified":false,"followers_count":3422,"friends_count":1734,"listed_count":145,"favourites_count":2705,"statuses_count":8596,"created_at":"Mon Jun 24 06:45:28 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/838641283375616000\/wLo35xnp_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/838641283375616000\/wLo35xnp_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1542665564\/1503892819","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"@susanprescott88 paediatrician representing the @TheRACP - #ClimateChange already affecting the physical and mental health of children. The #health of our children should be our priority - we urgently need #climateaction #NoTimeForGames @ama_media @RNBreakfast @DrGCrisp https:\/\/t.co\/8RQwPCHt8m","display_text_range":[0,270],"entities":{"hashtags":[{"text":"ClimateChange","indices":[59,73]},{"text":"health","indices":[140,147]},{"text":"climateaction","indices":[206,220]},{"text":"NoTimeForGames","indices":[221,236]}],"urls":[],"user_mentions":[{"screen_name":"susanprescott88","name":"Susan Prescott MDPhD","id":1213283748,"id_str":"1213283748","indices":[0,16]},{"screen_name":"TheRACP","name":"The RACP","id":1117895137,"id_str":"1117895137","indices":[48,56]},{"screen_name":"ama_media","name":"AMA Media","id":59024550,"id_str":"59024550","indices":[237,247]},{"screen_name":"RNBreakfast","name":"RN Breakfast","id":20138772,"id_str":"20138772","indices":[248,260]},{"screen_name":"DrGCrisp","name":"George Crisp","id":157568648,"id_str":"157568648","indices":[261,270]}],"symbols":[],"media":[{"id":1067192113324904449,"id_str":"1067192113324904449","indices":[271,294],"media_url":"http:\/\/pbs.twimg.com\/media\/Ds9tkqXUUAEk6Rk.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/Ds9tkqXUUAEk6Rk.jpg","url":"https:\/\/t.co\/8RQwPCHt8m","display_url":"pic.twitter.com\/8RQwPCHt8m","expanded_url":"https:\/\/twitter.com\/DocsEnvAus\/status\/1067192127124164612\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"small":{"w":680,"h":383,"resize":"fit"},"medium":{"w":1200,"h":675,"resize":"fit"},"large":{"w":2048,"h":1152,"resize":"fit"}}}]},"extended_entities":{"media":[{"id":1067192113324904449,"id_str":"1067192113324904449","indices":[271,294],"media_url":"http:\/\/pbs.twimg.com\/media\/Ds9tkqXUUAEk6Rk.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/Ds9tkqXUUAEk6Rk.jpg","url":"https:\/\/t.co\/8RQwPCHt8m","display_url":"pic.twitter.com\/8RQwPCHt8m","expanded_url":"https:\/\/twitter.com\/DocsEnvAus\/status\/1067192127124164612\/photo\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"small":{"w":680,"h":383,"resize":"fit"},"medium":{"w":1200,"h":675,"resize":"fit"},"large":{"w":2048,"h":1152,"resize":"fit"}}}]}},"quote_count":0,"reply_count":1,"retweet_count":5,"favorite_count":4,"entities":{"hashtags":[{"text":"ClimateChange","indices":[59,73]}],"urls":[{"url":"https:\/\/t.co\/AzaqkwAoj3","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/1067192127124164612","display_url":"twitter.com\/i\/web\/status\/1\u2026","indices":[117,140]}],"user_mentions":[{"screen_name":"susanprescott88","name":"Susan Prescott MDPhD","id":1213283748,"id_str":"1213283748","indices":[0,16]},{"screen_name":"TheRACP","name":"The RACP","id":1117895137,"id_str":"1117895137","indices":[48,56]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"ClimateChange","indices":[75,89]}],"urls":[],"user_mentions":[{"screen_name":"DocsEnvAus","name":"DrsForTheEnvironment","id":1542665564,"id_str":"1542665564","indices":[3,14]},{"screen_name":"susanprescott88","name":"Susan Prescott MDPhD","id":1213283748,"id_str":"1213283748","indices":[16,32]},{"screen_name":"TheRACP","name":"The RACP","id":1117895137,"id_str":"1117895137","indices":[64,72]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1543277951945"}

All tweets have this type of content in general, but the details may vary depending on things such as whether it is an original tweet or a retweet. Retweets include a "retweeted_status" parameter which contains all of the information about the original tweet which was retweeted. Other differences may be whether the tweet is geolocated with a latitude and longitude of the user's location, and so on.

Twitter Search API

We will not be using the Twitter Search API in these lessons but it is useful to be aware of its purpose. The main difference between using the Search API and the Streaming API is the Search API is typically for collecting past tweet data whereas the Streaming API is for current near real-time tweet data.

Twitter limits the Search API to roughly 100 tweets per 15 minute epoch. If you include a parameter to maximize the tweet count during a 15 minute epoch and that parameter exceeds the number of tweets during that 15 minute period, tweets which are duplicates of a previous 15 minute epoch will be included and must be filtered prior to analysis.

The Search API is often used to collect tweets within the past week or two and can be useful for filling in gaps of recent tweet activity. However, Twitter limits how far back in time you can search using a hashtag search query to roughly 1-2 weeks.

You can search a Twitter user's timeline and collect up to the most recent 3200 tweets regardless of how far back in time they occurred. You can find more online about the Twitter Search API.

Back