Data and ColumnDataSources
ColumnDataSources
To provide data to plotting routines, Bokeh introduces a new datatype named ColumnDataSource
(CDS). For many applications, a CDS is conceptually equivalent to a data table, or a DataFrame
in Pandas, with equal-length data columns associated with column labels. In this sense, they are also mostly equivalent to a Python dictionary that stores lists or NumPy arrays of data, keyed by labels, with the caveat that all the data arrays need to be of the same length.
In the previous page, we made some simple plots by passing in either two lists of numbers or two NumPy arrays to various plotting methods. Internally, Bokeh converted these separate lists and arrays into CDS's with two columns, which it then used for plotting. And in fact, if you passed in two lists that had different lengths, as discussed on the previous page, the warning message that you would see would include the phrase ColumnDataSource's columns must be of the same length
, even though the code itself in the example made no reference to a CDS. For simple datasets, it is often easier to pass in lists or arrays of data with explicitly constructing a CDS, but for more complicated data and applications, working with a CDS directly is preferable. We'll show a few examples of ColumnDataSources in action below.
In the code example above, we created an instance of a ColumnDataSource
object named source
from a Python dictionary containing the x and y data arrays, which we labeled by the keys 'x' and 'y', respectively. Then, when we call the circle
and line
methods for plotting, we passed the source as an argument, and specified the columns of source
that we wanted to use for plotting by specifying their names 'x' and 'y'.
Once we've constructed a CDS, we can update the data stored in the source rather easily, by setting the appropriate fields in the source's data
attribute, as shown below. This is especially useful if a dataset is being updated, for example, in a time series animation or in response to user inputs.
ColumnDataSources from Pandas DataFrames
While we created a CDS with data in a dictionary above, we can also do so using data in a Pandas DataFrame. DataFrames are very useful data structures for manipulating tabular data. In the code example below, a sample dataset is loaded into a DataFrame, which is subsequently used to populate a CDS. The sample data is the widely used automobile mpg dataset that lists information on a variety of different car models. Once the CDS is created, it is easy to generate plots of different columns against each other. If we plot the weight of each car model vs. the miles per gallon (mpg) for that model, we see an inverse relationship: lighter cars generally get better gas mileage.
mpg | cyl | displ | hp | weight | accel | yr | origin | name | mfr | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | North America | chevrolet chevelle malibu | chevrolet |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | North America | buick skylark 320 | buick |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | North America | plymouth satellite | plymouth |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | North America | amc rebel sst | amc |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | North America | ford torino | ford |