Cornell Virtual Workshop > Interactive Data Visualization with Bokeh > Key Concepts in Bokeh

Data and ColumnDataSources

ColumnDataSources

To provide data to plotting routines, Bokeh introduces a new datatype named ColumnDataSource (CDS). For many applications, a CDS is conceptually equivalent to a data table, or a DataFrame in Pandas, with equal-length data columns associated with column labels. In this sense, they are also mostly equivalent to a Python dictionary that stores lists or NumPy arrays of data, keyed by labels, with the caveat that all the data arrays need to be of the same length.

In the previous page, we made some simple plots by passing in either two lists of numbers or two NumPy arrays to various plotting methods. Internally, Bokeh converted these separate lists and arrays into CDS's with two columns, which it then used for plotting. And in fact, if you passed in two lists that had different lengths, as discussed on the previous page, the warning message that you would see would include the phrase ColumnDataSource's columns must be of the same length, even though the code itself in the example made no reference to a CDS. For simple datasets, it is often easier to pass in lists or arrays of data with explicitly constructing a CDS, but for more complicated data and applications, working with a CDS directly is preferable. We'll show a few examples of ColumnDataSources in action below.

# ColumnDataSource
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource

p = figure()

x = np.linspace(0.,10.,101)
y = np.sin(x) * np.cos(x)**2

source = ColumnDataSource({'x': x, 'y': y})

p.circle(x='x', y='y', source=source, size=10, color='purple', alpha=1.0)
p.line(x='x', y='y', source=source, color='orange')

show(p)

In the code example above, we created an instance of a ColumnDataSource object named source from a Python dictionary containing the x and y data arrays, which we labeled by the keys 'x' and 'y', respectively. Then, when we call the circle and line methods for plotting, we passed the source as an argument, and specified the columns of source that we wanted to use for plotting by specifying their names 'x' and 'y'.

Once we've constructed a CDS, we can update the data stored in the source rather easily, by setting the appropriate fields in the source's data attribute, as shown below. This is especially useful if a dataset is being updated, for example, in a time series animation or in response to user inputs.

# imagine we've got some new y data to plot
new_y = np.sin(x) * np.cos(x)**4

source.data['y'] = new_y

# the plot object p has already been created above

p.circle(x='x', y='y', source=source, size=10, color='green', alpha=0.5)
p.line(x='x', y='y', source=source, color='purple')

show(p)

ColumnDataSources from Pandas DataFrames

While we created a CDS with data in a dictionary above, we can also do so using data in a Pandas DataFrame. DataFrames are very useful data structures for manipulating tabular data. In the code example below, a sample dataset is loaded into a DataFrame, which is subsequently used to populate a CDS. The sample data is the widely used automobile mpg dataset that lists information on a variety of different car models. Once the CDS is created, it is easy to generate plots of different columns against each other. If we plot the weight of each car model vs. the miles per gallon (mpg) for that model, we see an inverse relationship: lighter cars generally get better gas mileage.

from bokeh.sampledata.autompg import autompg_clean as df
df.head()   # return the head of the DataFrame

The head (first 5 lines) of the mpg DataFrame
	mpg	cyl	displ	hp	weight	accel	yr	origin	name	mfr
0	18.0	8	307.0	130	3504	12.0	70	North America	chevrolet chevelle malibu	chevrolet
1	15.0	8	350.0	165	3693	11.5	70	North America	buick skylark 320	buick
2	18.0	8	318.0	150	3436	11.0	70	North America	plymouth satellite	plymouth
3	16.0	8	304.0	150	3433	12.0	70	North America	amc rebel sst	amc
4	17.0	8	302.0	140	3449	10.5	70	North America	ford torino	ford

from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource

p = figure()
output_file('mpg.html')  # relevant contents of mpg.html have been inserted below to make the plot

source = ColumnDataSource(df)

xcol = 'weight'
ycol = 'mpg'

p.circle(x=xcol, y=ycol, source=source, size=10, color='green', alpha=0.5)
p.xaxis.axis_label = xcol
p.yaxis.axis_label = ycol

show(p)

Back