This page contains another example of the use of hovering with Bokeh to reveal additional information as part of a data visualization. The visualization below is reproduced from our companion material on Python for Data Science, illustrating the use of dimensionality reduction methods supported by the scikit-learn package for machine learning. In particular, the TSNE method (t-distributed stochastic neighbor embedding) is used to reduce high-dimensional data involving baseball statistics for visualization in the interactive plot below. If you are interested in the details of that process, please consult our companion material.

The plot below represents an interactive visual map of the hitting statistics for every player who has played in Major League Baseball (MLB). Each point represents a single player, whose career hitting (or batting) statistics are summarized in the form of rate-per-at-bat for different hitting outcomes averaged over their career (specifically, the rate at which that player scored a run (R), got a hit (H), got a two-base hit (2B), got a three-base hit (3B), hit a home run (HR), suffered a strike out (SO), received a base on balls (BB), or achieved some other outcome not summarized in the plot below). Blue dots represent pitchers (specifically, those people listed in the pitching dataset who appeared as a pitcher in at least 10 games over their career), while red dots represent position players (referred to simply as "players" here). Large dots represent people who have been inducted in the Baseball Hall of Fame (HOF), while small dots represent people who have not.

Because this is a representation of hitting statistics, it is perhaps not surprising that pitchers and position players tend to be separated from one another (since pitchers are not primarily valued for their hitting skills, and have historically been poor hitters). This map is interactive, in that you can hover over a particular player and see information about their name and some career batting statistics. You can also pan and zoom around the map by selecting various tools in the right-hand panel if you want to examine subgroups in more detail. If you zoom in and want to reset to the full map, select the reset tool in the panel.

Our arbitrary and not particularly systematic choice to identify anyone as a "Pitcher" if they appeared in at least 10 games as a pitcher over the course of their career has interesting implications. For example, Jimmie Foxx — one of the greatest sluggers in the history of baseball — more or less came out of retirement in 1945 when many other players were fighting in World War II, and played as both a position player and a pitcher, just reaching the arbitrary 10-game threshold. He shows up in the hitting map alongside some other Hall of Fame sluggers, including Babe Ruth — who actually did pitch a lot and pitched very effectively in the early part of his career before being converted to a full-time position player in order to make best use of his prodigious hitting skills.

While the details of TSNE plots such as this are difficult to interpret, and change stochastically from run to run, the fact that many HOF position players are bunched up along the edges of the map presumably indicates that they are "outliers" with regard to their hitting skills, perhaps providing insight into their inclusion in the Hall of Fame. There are different subsets of HOF players, which perhaps reflect both different hitting characteristics (e.g., sluggers vs. not) or different eras. Some of the HOF position players situated more toward to interior of the diagram seem to reflect lighter-hitting players more valued for their fielding skills. If you poke around the southernmost part of the map, you can find that the statistics of Tris Speaker (in the HOF) and Shoeless Joe Jackson (not) are extremely similar. Shoeless Joe's exclusion from the Hall is something that ought to be rectified.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement