Cornell Virtual Workshop > Python for High Performance > Writing Faster Python

Python Container Types

Python provides a number of container datatypes, both built-in types and those in the collections module in the Python Standard Library. Different data containers serve different purposes, provide different functionality, and present potentially very different computational performance for similar sorts of calculations. Thus, choosing the right container for the task at hand is an important step in achieving good performance. (And when we speak of "containers" here, we are referring to data structures that can hold collections of other data types, not the sorts of software tools like Docker and others that support the reproducible construction of containerized computing environments.)

The core built-in containers in Python are lists, tuples, dictionaries, and sets:

list: a mutable sequence type, holding a collection of objects in a defined order (indexed by integers)
tuple: an immutable sequence type, holding a collection of objects in a defined order (indexed by integers)
dict: a mapping type, associating keys to values (unordered, indexed by keys)
set: an unordered collection of unique elements (accessed through set operations)

All of these container types are iterables, that is, objects that can be iterated over such that each element in the container is visited once. Various operations can often be implemented using these different container types, but with potentially different performance for large datasets, as well as more or less compact programming interfaces. A nice resource summarizing the time complexity of different operations in Python containers can be found in the Python wiki. As a simple example, imagine we want to find the intersection of two sets, that is, all elements that are in both sets. The built-in set datatype works very nicely for this:

set1 = set(range(0,10000))        # create a set with a bunch of numbers in it
set2 = set(range(5000,20000))     # create another set with a bunch of numbers in it
isec = set1 & set2                # same as isec = set1.intersection(set2)

Now, imagine that we don't know about the set datatype and decide to do this with lists instead:

list1 = list(range(0,10000))      # create a list with a bunch of numbers in it
list2 = list(range(5000,20000))   # create another list with a bunch of numbers in it
isec = [e1 for e1 in list1 for e2 in list2 if e1==e2]    # uses list comprehensions

The set-based intersection code is certainly more compact than the list-based code, but the real difference is in their run times (recall that %timeit is an IPython magic function, not a Python expression):

In [10]: %timeit set1 & set2
150 μs ± 2.02 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit [e1 for e1 in list1 for e2 in list2 if e1==e2]
3.05 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The set-based comparison runs more than 20000 times faster than the list-based comparison (150 μs as compared to 3.05s), and the discrepancy will only grow as the problem size increases. The time complexity of the list-based comparison is O(NM), where N and M are the lengths of the two lists being compared, since all N*M pairs of elements must be examined. In contrast, from the TimeComplexity page, we can see that the set-based intersection is O(min(N,M)).

Try this yourself: fire up an ipython interpreter, create lists and sets of various sizes, and do some timing tests. How does the run time of each of the set intersection operations scale with the size of the data?

Membership testing

In addition to iteration, all containers support membership testing: the in keyword specifies that either True or False be returned depending on whether or not object a is in container b (via the expression a in b). But membership testing with sets and dictionaries is much faster, by using hashing (but is therefore restricted such that the keys in dictionaries and elements in sets must be hashable objects). For dictionaries and sets, membership testing is O(1) in time, independent of the size of the container, while membership testing with lists and tuples requires time of O(n) for a container with n elements. (With a list, you might get lucky if the element you are looking for is near the front of the list, since it will be found reasonably quickly; but if you are unlucky, the element of interest will be near the end. On average, the run time for testing will scale as n/2.) So if you are creating a container for the purpose of membership testing, use a set rather than a list. You might be traversing a graph, for example, and want to keep track of which nodes you have already visited so that you do not visit them again — store these in a set. Using our examples above, we can time how long membership testing takes (and can see that the set-based lookup is more than 2000 times faster than the list-based version):

In [3]: %timeit 9800 in set1
40 ns ± 0.0594 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit 9800 in list1
91.4 μs ± 280 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For dictionary datatypes, membership testing determines whether a specified object is in the set of keys of the dictionary, via an expression like some_key in my_dictionary. One could alternatively inquire whether a specified key is in the dictionary by explicitly testing membership in the dictionary's keys, via some_key in my_dictionary.keys(). But this is slower than the first because it involves the temporary creation of the set of dictionary keys: implemented as a list in Python 2 (which is very slow with O(n) lookup), and as a dict_keys object in Python 3 (which appear to have O(1) lookup, but not as fast as direct testing via some_key in my_dictionary). The dictionary knows how to compute efficiently whether it has a key, so just ask it directly without trying to dictate how that computation should be done. Note also that in Python 2, dictionaries have a has_key method, but that has been removed in Python 3 in favor of the test key in dict.

To summarize:

some_object in my_set              # fast: O(1) for sets
some_object in my_list             # slow: O(n) for lists
some_key in my_dictionary          # best way to test key membership in dict
some_key in my_dictionary.keys()   # same result but slower -- much slower in Py 2

In addition to the built-in containers discussed above, the collections module in the Standard Library provides several other containers useful for other applications (reproduced here from the documentation for the module):

`namedtuple()`	factory function for creating tuple subclasses with named fields
`deque`	list-like container with fast appends and pops on either end
`ChainMap`	dict-like class for creating a single view of multiple mappings
`deque`	list-like container with fast appends and pops on either end
`ChainMap`	dict-like class for creating a single view of multiple mappings
`Counter`	dict subclass for counting hashable objects
`OrderedDict`	dict subclass that remembers the order entries were added
`defaultdict`	dict subclass that calls a factory function to supply missing values
`UserDict`	wrapper around dictionary objects for easier dict subclassing
`UserList`	wrapper around list objects for easier list subclassing
`UserString`	wrapper around string objects for easier string subclassing

Itertools

While most iteration over containers is of the garden variety (for-loops and while-loops), sometimes more exotic forms of iteration are required, or perhaps combinations of iterator patterns. Rather than constructing these on your own, it might be preferable to use one or more items from the itertools module in the Standard Library. The module standardizes a core set of fast, memory efficient tools that are useful by themselves or in combination, forming a type of "iterator algebra" that enables complex yet efficient iteration protocols in pure Python. As a partial list, this includes operations for counting, accumulating, chaining, grouping, filtering, slicing, permuting and combining iterables. (Some of these same sorts of iteration protocols are also provided in conjunction with high-performance datatypes found in third-party libraries.)

Back