While adding multithreading support to a Python script, I found myself thinking again about the difference between multithreading and multiprocessing in the context of Python.
For the uninitiated, Python multithreading uses threads to do parallel processing. This is the most common way to do parallel work in many programming languages. But CPython has the Global Interpreter Lock (GIL), which means that no two Python statements (bytecodes, strictly speaking) can execute at the same time. So this form of parallelization is only helpful if most of your threads are either not actively doing anything (for example, waiting for input), or doing something that happens outside the GIL (for example launching a subprocess or doing a numpy calculation). Using threads is very lightweight, for example, the threads share memory space.
Python multiprocessing, on the other hand, uses multiple system level processes, that is, it starts up multiple instances of the Python interpreter. This gets around the GIL limitation, but obviously has more overhead. In addition, communicating between processes is not as easy as reading and writing shared memory.
To illustrate the difference, I wrote two functions. The first is called idle
and simply sleeps for two seconds. The second is called busy
and
computes a large sum. I ran each 15 times using 5 workers, once using threads
and once using processes. Then I used matplotlib to visualize the
results.
Here are the two idle
graphs, which look essentially identical.
(Although if you look closely, you can see that the multiprocess version is
slightly slower.)
And here are the two busy
graphs. The threads are clearly not helping
anything.
As is my custom these days, I did the computations in an iPython notebook.