Dask

Dask aims to provide familiar and compatible interfaces to core elements of the Python scientific computing ecosystem that have been extended to support distributed computations. Paraphrasing from the dask documentation, this includes:

  • providing parallelized NumPy array and Pandas DataFrame objects
  • providing a task scheduling interface for more custom workloads and integration with other projects
  • enabling distributed computing in Pure Python with access to the PyData stack
  • operating with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
  • running resiliently on clusters with 1000s of cores, while installing easily on laptops and workstations

Dask is composed of two main components:

  • Dynamic task scheduling optimized for computation. The scheduler can be backed by either a process pool or a thread pool.
  • "Big Data" collections like parallel arrays, dataframes, and lists that extend interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.

Dask is focused largely on supporting the sorts of applications and use cases increasingly developed by the Python scientific computing community, by integrating with and leveraging the power of NumPy, Pandas, Scikit-learn, and other libraries. While there is good support for distributed NumPy arrays, not all NumPy functionality is supported in Dask, most notably the linear algebra routines provided by numpy.linalg. So Dask might not be suitable for all applications, but does provide a convenient mechanism to scale many familiar Python datatypes and APIs to larger, distributed platforms.

Joblib

Joblib is a set of tools to provide lightweight pipelining in Python. Thus, while not focusing primarily on supporting general purpose distributed computation, it does offer support for easy, simple parallel computing, most often in the form of embarrassingly parallel for loops. Consider the following sample code and output, which computes a list comprehension in parallel with 2 jobs:

By default, joblib uses the Python multiprocessing module to fork separate Python worker processes to execute tasks concurrently on separate CPUs. Alternatively, in certain applications, users can use a threading backend. Combined with other tools for pipeline management, joblib can potentially offer lightweight parallel support for particular types of computations.

ipyparallel

ipyparallel is a package that leverages the power of IPython to support parallel and distributed computing, enabling all types of parallel applications to be developed, executed, debugged, and monitored interactively. The ipyparallel architecture abstracts out parallelism in a general way by coordinating computational engines via a controller that schedules and manages tasks, enabling IPython to support many different styles of parallelism, including:

  • Single program, multiple data (SPMD) parallelism
  • Multiple program, multiple data (MPMD) parallelism
  • Message passing using MPI
  • Task farming
  • Data parallel computing
  • Combinations of these approaches
  • Custom user-defined approaches

With ipyparallel, the controller and each engine can run on different machines or on the same machine, and can be integrated with MPI/mpi4py calls.

Logging

If you are running Python code in parallel, or running it remotely, it helps to have good logging. The common pattern for Python's standard logging module uses a separate logger instance for each file.

We see that the recommended way to instantiate a logger object is a little unusual in that the logger is tied to the name of a specific module, most often __name__, or the name of the currently active module. Somewhere during initialization of the program, you configure the logging that applies to the rest of the library or program. The tutorials show the use of basicConfig for this purpose.

The DEBUG specification says to log all events that are of "debug" level or above. To create a debug-level entry for the log, you simply call the debug method on the logger.

This is enough to log to a file. For running parallel distributed code, you can create a separate log for each process, but you will want to ensure that each of the logs has a distinct name (e.g., encoding the hostname in the logfile might not be sufficient if you have multiple processes running on the same node). You can also check the MPI rank to selectively log only from the rank 0 process and (perhaps) from one of the other processes.

If you don't use basicConfig, the process is still straightforward. Loggers send messages to Handlers which print with Formatters. There is a standard configuration file you can write, or you can set it up programmatically. You just hook up the Handler and Formatter, and add them to the root handler.

We just added a DatagramHandler, which sends messages over the network via UDP instead of writing them to disk. This might be a good idea for large parallel jobs that want to write fewer streams to shared disk. They can send all of their debugging messages to a single thread that buffers them to disk.

What if you want to log messages from just the edu.univ.matrix class? You can add a handler for that specific class at a lower level.

Given the difficulty of tracing code running on thousands of cores, this kind of precision can be helpful.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement