Introducing Datasets
We consider in this roadmap three datasets: one involves baseball statistics, one involves tweets and retweets in Twitter, and a third contains historical data about wildfires in California. Any particular dataset will be revisited multiple times throughout this roadmap in order to introduce new types of functionality or to highlight the use of particular tools. For readers interested in a specific dataset, this mode of presentation might be suboptimal.
Accompanying the narrative in this CVW roadmap is a set of Jupyter notebooks that work through analyses carried out on each of the datasets, and present some additional analyses and exercises that readers might like to work through as well. We encourage you to download these notebooks and run them locally, while following along with the CVW tutorial, if you'd like to get some hands-on experience or would like to dig down into these datasets further. Further general information on working with Jupyter notebooks can be found on the Jupyter home page.
The Jupyter notebooks are available in our accompanying github repository, stored in the subdirectory named code
. (Alongside the code
directory is another directory named data
, containing relevant data for these materials. Assuming you have git set up where you'd like to run this code, you can clone the repository with the commands:
The last two commands in the set above enable you to clone the baseball data, which is contained in a separate github repository, and which is included as a submodule in our tutorial repository. This will download the baseball data into a subdirectory named data/baseballdatabank
. The wildfire dataset is also included in the github repository, stored in the subdirectory named data/wildfires
. The Twitter data file is too large for the repository, and is instead available through a separate download, with a link below. After downloading that file, you can move it into the repository subdirectory named data/twitter
, which is where the code in the notebooks expects it to reside.
Alternatively, if you just want to view the rendered notebooks in github, or download them individually without using git, you can access them in your browser through the links below.
Baseball Notebook
Twitter Notebooks & Data
- Tweet data extraction notebook
- Tweet timeline visualization notebook
- Tweet interactive visualization notebook
- Tweet classification notebook
- Retweet network notebook
- Twitter csv data file (approximately 119 MB)