15 May 2021, Samuel Hinton

Simple Multiprocessing In Python: Comparing core vs libraries


Comparing inbuilt solutions to a range of external libraries.

In this short writeup I’ll give examples of various multiprocessing libraries, how to use them with minimal setup, and what their strengths are.

If you want a TL;DR - I recommend trying out loky for single machine tasks, check out Ray for larger tasks.

CPU bound tasks

Most of the jobs we (I) want to execute are CPU bound. All I’m doing is crunching some numbers, retraining models, over and over, and I want to spend the least amount of time twiddling my thumbs.

So let’s simulate a common task, which is evaluating some arbitrarily slow function over and over again. Here I create a function which takes some array of parameters as input, and produces a single number as output. And then I create 512 different vectors in parameter space, and want to evaluate the function of each of the 512 sets of parameters.

import numpy as np

def slow_fn(args):
    """ Simulated an optimisation problem with args coming in
    and function value being output """
    n = 10000
    y = 0
    for j in range(n):
        j = j / n
        for i, p in enumerate(args):
            y += j * (p ** (i + 1))
    return y / n

def get_jobs(num_jobs=512, num_args=5):
    """ Simulated sampling our parameter space multiple times """
    return [j for j in np.random.random((num_jobs, num_args))]

jobs = get_jobs()

# Check out this single core performance
for job in jobs:
    slow_fn(job)

On my struggling laptop, these 512 jobs took 20.30s to complete when running the functions in serial.

Pythons Concurrent.Futures

It’s best to start with some of the provided options in the standard library.

from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(slow_fn, jobs, chunksize=16));

With four workers and a chunksize of 16, this took me 6.47s. If we disable chunking (which is a bad idea for small functions like this when you have a lot of them), it takes 6.57s. Not much of a difference, but then again, 512 jobs isn’t a particularly large number. Like effectively all of the multiprocessing options provided in the standard library, this method relies on process forking and will not work inside a Jupyter notebook, instead you’ll need to throw the code into a python file and run it the good old python yourfile.py way.

Loky

Loky (from the smart cookies at joblib) is a suped up version of the Executor pool above that features reusable executors and code serialisation.

from loky import get_reusable_executor
executor = get_reusable_executor(max_workers=4)
results = list(executor.map(slow_fn, jobs, chunksize=16));

This ran in 6.31s on my machine, the fastest yet.

Loky has the benefit of distributing work by pickling the code you are trying to run and the arguments (using cloudpickle). This means it is much more flexible, and this method will run inside a Jupter notebook. Hell, it should run just about anywhere.

The benefit you get from this freedom might start to be outweighed by the fact you are adding overhead (serialising the code) to your workload, but often this can be kept to a bare minimum. And in our case, the overhead of pickling the code was less than the overhead of the process setup used by the inbuilt concurrent.futures module, so we ran things faster.

Cloudpickle will not serialise code outside of the local module, so if you want to lower the pickling overhead, just extract the function to another file (and if you’re shipping the code off to an external machine, make sure it has said file too).

To try and make sure this point is clear, the function slow_fn is defined within this file. Cloudpickle will turn all its code into bytes. If I move the function into foo.py, cloudpickle would only save out “Call foo.slow_fn” instead of the code itself. Reduce overheads, but add requirements for code to be in the right place. This overhead should be minimial unless you have a truly mammoth single function you’re exporting. Still, your choice!

MPI4PY

MPI (Mesasge Parsing Interface) is a super handy way of spreading computational load not just around on one CPU, but across multiple CPU. mpi4py is a python implementation making our lives easier. It is used commonly in super computers, where you use systems like PBS, SGE, Torque, or Slurm to request many CPUs that might be located on completely different nodes. If you are only looking at once CPU and have no plans to move off it, there are simpler methods than MPI. If you did want to use MPI, you would do something like this:

from mpi4py.futures import MPIPoolExecutor
with MPIPoolExecutor() as executor:
    results = list(executor.map(slow_fn, jobs, chunksize=16));

And you would execute the code not using python yourfile.py, but instead use mpirun or mpiexec and tell it how many cores you have. Like so:

mpiexec -n 8 python =-m mpi4py.futures yourfile.py

I’ve used MPI to distribute parallel processing loads which require minimal cross-talk. To go back to my astrophysics days, if you have 10k images of the night sky and need to process all of them, this is a great way of easily shipping the processing off to whatever CPU you can get your hands on in the supercomputers you have access to.

Just keep in mind - the message parsing part is the expensive part. Minimise the network overhead to maximise your processing speed. For local CPU tasks, this will give you the same speed as concurrent.futures.

Ray

Want to go a lot fancier and start bringing in some big guns? Ray is also great for distributing your tasks over more than one CPU, and the setup for it is also very minimal. That being said, don’t think Ray is a simple piece of code, there is a LOT in it, and it can do a lot of things (dashboards, autoscaling, model serving, and a whole bunch more).

import ray
ray.init()

workfn = ray.remote(slow_fn)
results = [workfn.remote(job) for job in jobs]
ray.get(results)

ray.shutdown();

Because this is also shipping your code elsewhere, it should run no issues in a Jupyter Notebook. Not that you’d normally want to do that, generally you’d put the ray server on a compute node somewhere, and then just connect to it, farming your jobs out.

Anyway, the principle is straightforward. I set up a server, tell it that a specific function should be executed remotely (which in this case, is still my machine, but using all my cores now), and then send it off.

All up, this took 16s to run, but of those, 4.89s was on actual computation, and the other 12s was setting up and shutting down the server.

Pretty impressive. Add onto that the fact that Ray has a ton of integrations (including some great HyperOptSearch) and I’m a fan.

Dask

If you liked the sound of Ray, you’ll probably like the sound of Dask. Similar principle with workers and a controlling node. More focus on numerical computation, and it sits behind a lot of other distributed software (like Prefect as a single example). Note that if you’re on windows, this may give you some issues, and for me versioning mismatches in the conda release made this a painful install, but hopefully this was just me getting unlucky with updating to a bugged version, and it doesn’t affect anyone else.

Once you have it installed, the rest is easy. Again, super basic usage only - making Client() set up a local client in the background isn’t something you’d productioni

from dask.distributed import Client
client = Client()
# Send off jobs
futures = client.map(slow_fn, jobs)
# Get their outputs
results = client.gather(futures)

All up this took 7.4s, with 5.8s spent on compute and the rest launching the local cluster in the background.

p_tqdm (aka Pathos + tqdm)

tqdm is not an acronym, but it is a progress bar. Pathos is a framework for heterogeneous computing. p_tqdm is a library where @swansonk14 has stuck them both together. So think more a competitor to Ray than to Loky.

from p_tqdm import p_map
results = p_map(slow_fn, jobs);
100%|██████████| 512/512 [00:05<00:00, 96.89it/s] 

I was surprised this ran faster than both concurrent.futures and loky, it came in at only 5.38s. And you even get a progress bar so that you know things are still running and progressing smoothly. Obviously in my case, I don’t really need it, but if you have a job which will take 10 hours to run, it would be great to know that its slowly chewing through the tasks and not actually hanging.

And as you saw, its ridiculously easy to set up. If you want to see the other things you can do map (blocking) vs imap (non-blocking) vs amap (async) then jump into their documentation linked above.

If you don’t care at all about the progress bar, to get stock pathos multiprocessing working, this will work:

from pathos.multiprocessing import ProcessPool
pool = ProcessPool()
results = pool.map(slow_fn, jobs, chunksize=16);

Memory Blocking

A common issue with attempts at multiprocessing on a single physical CPU is that there are many things which can bottleneck CPU execution. On many machines, memory management is done at the kernel level, which means that any malloc calls cannot be done in parallel.

First, consider our original slow_fn - it took 20 seconds to run in serial, and around 5s to run over 4 cores. This is the ideal result.

Now consider a vectorised version of our slow_fn, like so:

def slow_fn_malloc(args):
    n = 100000
    x = np.linspace(0, 1, n)
    y = np.zeros(n)
    for i, p in enumerate(args):
        y += x * (p ** (i + 1))
    return y.sum() / n

for job in jobs:
    slow_fn_malloc(job)

Now, by all accounts, this function is better. Running this took 58ms. From 20 seconds down to a twentieth of a second… Vectorisation is great.

Let me increase the number of jobs, now that we’re burning through jobs so quickly. Oh, I’ll also make n ten times larger, to increase the numerical precision of our function.

def get_many_jobs(num_jobs=4096, num_args=5):
    return [j for j in np.random.random((num_jobs, num_args))]

many_jobs = get_many_jobs()
%%time
for job in many_jobs:
    slow_fn_malloc(job)
Wall time: 3.56 s

So with even more jobs, and an increase in n, this now takes 3.56s to take. Still, very impressive considering this is all on one core now. So what happens if we try to ship it out to multiple cores? Any of the above libraries would work, I’ll just use loky:

%%time
executor = get_reusable_executor(max_workers=4)
results = list(executor.map(slow_fn_malloc, many_jobs, chunksize=16));
Wall time: 3.54 s

Wait, 3.54s… it didn’t improve at all!

The reason is fairly simple. In slow_fn_malloc the time is being taken creating the arrays, not adding them up. And because creating the arrays requires assigning memory, and thus on many operating systems (like macOS, Windows, and non-compute linux distros) is not parallelised, it doesn’t matter how many CPU cores you have, you’re just going to be waiting on memory.

Just thought I’d bring that up, in case you’re trying to debug why your execution time isn’t scaling like how you want - network, disc, and memory are the three most common bottlenecks that get in the way.

Anyway, hope this short example on how to use a bunch of different multiprocessing libraries is useful!

Here’s the full code for convenience:

from concurrent.futures import ProcessPoolExecutor
from dask.distributed import Client
from loky import get_reusable_executor
from mpi4py.futures import MPIPoolExecutor
from p_tqdm import p_map
from pathos.multiprocessing import ProcessPool
import numpy as np
import ray

# Loky, great for single machine parallelism 
executor = get_reusable_executor()
results = list(executor.map(fn, jobs))

# Ray, great for distributing over machines
ray.init()
workfn = ray.remote(fn)
results = [workfn.remote(job) for job in jobs]
ray.get(results)
ray.shutdown();

def slow_fn(args):
    """ Simulated an optimisation problem with args coming in
    and function value being output """
    n = 10000
    y = 0
    for j in range(n):
        j = j / n
        for i, p in enumerate(args):
            y += j * (p ** (i + 1))
    return y / n

def get_jobs(num_jobs=512, num_args=5):
    """ Simulated sampling our parameter space multiple times """
    return [j for j in np.random.random((num_jobs, num_args))]

jobs = get_jobs()

# Check out this single core performance
for job in jobs:
    slow_fn(job)

with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(slow_fn, jobs, chunksize=16));

executor = get_reusable_executor(max_workers=4)
results = list(executor.map(slow_fn, jobs, chunksize=16));

with MPIPoolExecutor() as executor:
    results = list(executor.map(slow_fn, jobs, chunksize=16));

ray.init()

workfn = ray.remote(slow_fn)
results = [workfn.remote(job) for job in jobs]
ray.get(results)

ray.shutdown();

client = Client()
# Send off jobs
futures = client.map(slow_fn, jobs)
# Get their outputs
results = client.gather(futures)

results = p_map(slow_fn, jobs);

pool = ProcessPool()
results = pool.map(slow_fn, jobs, chunksize=16);

def slow_fn_malloc(args):
    n = 100000
    x = np.linspace(0, 1, n)
    y = np.zeros(n)
    for i, p in enumerate(args):
        y += x * (p ** (i + 1))
    return y.sum() / n

for job in jobs:
    slow_fn_malloc(job)

def get_many_jobs(num_jobs=4096, num_args=5):
    return [j for j in np.random.random((num_jobs, num_args))]

many_jobs = get_many_jobs()

%%time
for job in many_jobs:
    slow_fn_malloc(job)

%%time
executor = get_reusable_executor(max_workers=4)
results = list(executor.map(slow_fn_malloc, many_jobs, chunksize=16));