Cornell Virtual Workshop > Python for High Performance > Writing Faster Python

Memory Management

While we are primarily focused here on strategies to improve program run times, performance also is impacted by usage of processor memory. Within the Python interpreter, objects are garbage collected, which means the interpreter only frees memory after the program no longer holds a reference to that memory. This frees the programmer from having to deal with many of the memory allocation and deallocation issues that arise in other languages, but can also make the programmer blissfully unaware of a bloated memory footprint. Internally, the Python interpreter has a rather complicated memory management system. While it might return some of the freed heap memory to the OS, much of the freed memory remains allocated by the interpreter, waiting to be reused once new objects are created. If your application is memory bound, there are a few resources and strategies you can use.

Tools for monitoring memory usage

Third-party libraries, such as memory_profiler, provide various functions for monitoring memory usage. In addition, the sys and resources modules in the standard library provide some support for monitoring the memory footprint of particular objects or the interpreter as a whole. In particular:

sys.getsizeof(object) returns the size in bytes of the specified object
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss returns the memory size in bytes currently allocated by the interpreter

Strategies for addressing the performance impacts of memory usage

In the section on Lazy Evaluation, we introduced an important strategy for minimizing the unnecessary creation of large containers whose only purpose is to be iterated over one element at a time.
Python and its associated libraries provide a number of convenient interfaces for dynamically constructing containers and other datatypes, which can be very useful when the ultimate size of the container is unknown at the outset. Appending elements to a list in a for-loop is one widely encountered example. Internally, the Python interpreter allocates memory for such objects in chunks, and only needs to allocate more memory once the object outgrows the current allocation. But this can lead to memory fragmentation and associated performance hits to accessing all the elements of the container efficiently. A NumPy array, on the other hand, allocates all the memory it needs in a chunk for its prescribed size. So if the ultimate size of a container is known, it can sometimes be useful to pre-allocate an object of that size and then fill it in. For NumPy arrays, the numpy.zeros() function (and, with some caution, the numpy.empty() function) can be used for such pre-allocation. The internal management of memory also impacts the performance of different container operations, and sometimes might imply a different choice of containers based on memory access. With regard to Python lists, for example, the Time Complexity page notes: "Internally, a list is represented as an array; the largest costs come from growing beyond the current allocation size (because everything must move), or from inserting or deleting somewhere near the beginning (because everything after that must move). If you need to add/remove at both ends, consider using a collections.deque instead."
While the memory footprint of the Python interpreter generally continues to grow as more and more objects are created (see more about this below), some of this memory allocation can be walled off by creating a separate process, the memory for which will be returned to the OS after the process terminates. This can be achieved, for example, using the multiprocessing module to start up a separate process to carry out some particular set of tasks. This strategy might be useful if a lot of memory is needed to preprocess a set of input data, which will then be distilled to a smaller set for further analyses.
In assignment statements such as y = f(x), the right-hand side of an assignment is first evaluated (f(x)), and then assigned to the variable on the left-hand side (y). If the variable y had previously been assigned to a different object, it will be unbound from that object upon assignment to f(x), and the memory associated with the previously bound object will be garbage collected (assuming there are no other remaining references to it). But that means the interpreter must first allocate memory for the object created by f(x) before it can free the memory associated with the object previously bound to y. In many applications, we overwrite a variable name within a loop, as in this somewhat contrived example:

import numpy as np

number_of_timesteps = 3

a_current = np.random.random((1000,1000))
for n in range(number_of_timesteps):
    a_next = a_current.copy()
    np.random.shuffle(a_next)  # shuffles in place
    a_diff = a_next - a_current
    # do some further analysis with a_diff

Clearly, because we are creating three arrays — a_current, a_next, and a_diff — we need to pay the memory price for each of those. Since each of those is a 1000x1000 array of floats, we would expect each to be approximately 1000 x 1000 x 8 bytes = 8000000 bytes = 8 MB in size. But now consider this instrumented version of the code (using the resource module described above to monitor the growth of the memory footprint):

import numpy as np
import resource

m0 = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

number_of_timesteps = 3

a_current = np.random.random((1000,1000))
a_next = None
a_diff = None
print(1, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss - m0)
for n in range(number_of_timesteps):
    a_next = a_current.copy()
    np.random.shuffle(a_next)  # shuffles in place
    print(n, 2, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss - m0)
    a_diff = a_next - a_current
    print(n, 3, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss - m0)

When we run this code (in a clean interpreter session), we get output like the following:

1 8011776
0 2 16068608
0 3 24092672
1 2 32100352
1 3 32100352
2 2 32100352
2 3 32100352

On the initial creation of a_current, we pay the memory cost of 8011776 bytes. In the first pass through the loop, we pay an additional cost of another 8 MB or so to create a_next as a shuffled copy of a_current, and then a third chunk of 8 MB to create a_diff. But on the next pass through the loop (n=1), we incur a fourth memory hit of 8 MB bytes after we update a_next. This is because the right-hand side (a_current.copy()) allocates 8 MB, and then assigns that object to a_next. The memory formerly attached to a_next is returned to the memory pool, but the overall pool size has now grown to approximately 32 MB. At that point, the memory footprint remains stable because the interpreter has a free 8 MB to use to allocate new instances of a_next and a_diff.

Consider, instead, this alternate version of the code, in which we pre-delete a_next and a_diff prior to their assignment:

import numpy as np
import resource

m0 = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

number_of_timesteps = 3

a_current = np.random.random((1000,1000))
a_next = None
a_diff = None
print(1, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss - m0)
for n in range(number_of_timesteps):
    del a_next
    a_next = a_current.copy()
    np.random.shuffle(a_next)  # shuffles in place
    print(n, 2, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss - m0)
    del a_diff
    a_diff = a_next - a_current
    print(n, 3, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss - m0)

When we run this code (in a clean interpreter session), we get output like the following:

1 8011776
0 2 16068608
0 3 24092672
1 2 24096768
1 3 24096768
2 2 24096768
2 3 24096768

We see now that we allocate only 3 of the 8-MB chunks. This is because we are freeing up the memory attached to a_next and a_diff prior to their being reassigned, giving the interpreter the free 8 MB it needs in each instance in order to create the array on the right-hand side, without requiring the interpreter to allocate even more memory. And while this example is contrived and only results in a memory savings of 8 MB, in a real application memory inefficiencies of this sort can add up in various parts of a program. This is especially true if existing objects tend to be replaced with larger ones; then the interpreter may be unable to reuse the garbage-collected memory, and memory usage can expand quickly due to fragmentation.

The previous discussion assumes that an object is being created and returned by a function, and then being assigned to a variable. Python functions can manipulate objects that are input as arguments, so in some cases it might be preferable to pre-allocate an object that will be manipulated in place by a function that returns nothing, rather than dealing with the memory allocation issues discussed above.

Back