Threading in Python

In my last
article, I took a short tour through the ways you can add
concurrency to your programs. In this article, I focus on one of those forms
that
has a reputation for being particularly frustrating for many
developers: threading. I explore the ways you can use threads in
Python and the limitations the language puts upon you when doing so.

The basic idea behind threading is a simple one: just as the computer can run
more than one process at a time, so too can your process run more than one
thread at a time. When you want your program to do something in the background,
you can launch a new thread. The main thread continues to run in the
foreground, allowing the program to do two (or more) things at once.

What's the difference between launching a new process and a new thread? A new
process is completely independent of your existing process, giving you more
stability (in that the processes cannot affect or corrupt one another) but
also less flexibility (in that data cannot easily flow from one thread to
another). Because multiple threads within a process share data, they can work
with one another more closely and easily.

For example, let's say you want to retrieve all of the data from a variety
of websites. My preferred Python package for retrieving data from the web is
the "requests" package, available from PyPI. Thus, I can use a
for loop, as
follows:

How does this program work? It goes through a list of URLs (as strings), one
by
one, calculating the length of the content and then storing that
content inside a dictionary called length. The keys in
length
are URLs, and the values are the lengths of the requested URL content.

So far, so good; I've turned this into a complete program
(retrieve1.py), which is shown in Listing 1. I put nine URLs into a text
file called urls.txt (Listing 2), and then timed how long
retrieving each of them took. On my computer, the total time was
about 15 seconds, although there was clearly some variation in the
timing.

Improving the Timing with Threads

How can I improve the timing? Well, Python provides threading. Many
people think of Python's threads as fatally flawed, because only one
thread actually can execute at a time, thanks to the GIL (global
interpreter lock). This is true if you're running a program that is
performing serious calculations, and in which you really want the
system to be using multiple CPUs in parallel.

However, I have a different sort of use case here. I'm interested
in retrieving data from different websites. Python knows that I/O
can take a long time, and so whenever a Python thread engages in I/O
(that is, the screen, disk or network), it gives up control and hands
use of the GIL over to a different thread.

In the case of my "retrieve" program, this is perfect. I can spawn a
separate thread to retrieve each of the URLs in the array. I then
can wait for the URLs to be retrieved in parallel, checking in with each
of the threads one at a time. In this way, I probably can save time.

Let's start with the core of my rewritten program. I'll want to
implement the retrieval as a function, and then invoke that function
along with one argument—the URL I want to retrieve. I then
can invoke that function by creating a new instance of
threading.Thread,
telling the new instance not only which function I want to run in a
new thread, but also which argument(s) I want to pass. This is how
that code will look:

But wait. How will the get_length function communicate the content
length to the rest of the program? In a threaded
program, you really
must not have individual threads modify built-in data structures,
such as a list. This is because such data structures aren't
thread-safe, and doing something such as an "append" from one thread might
cause all sorts of problems.

However, you can use a "queue" data structure, which is thread-safe,
and thus guarantees a form of communication. The function can put
its results on the queue, and then, when all of the threads have
completed their run, you can read those results from the queue.

As you can see, the function retrieves the content of
one_url and
then places the URL itself, as well as the length of the content, in a
tuple. That tuple is then placed in the queue.

It's a nice little program. The main thread spawns a new
thread, each of which runs get_length. In
get_length, the
information gets stuck on the queue.

The thing is, now it needs to retrieve things from the queue. But if you
do this just after launching the threads, you run the risk of reading
from the queue before the threads have completed. So, you need to "join" the threads, which means to wait until they have finished. Once
the threads have all been joined, you can read all of their information
from the queue.

There are a few different ways to join the threads. An
easy one is to create a list where you will store the threads and
then append each new thread object to that list as you create it:

Note that when you call one_thread.join() in this way, the call
blocks. Perhaps that's not the most efficient way to do things, but in
my experiments, it still took about one second—15 times faster—to
retrieve all of the URLs.

In other words, Python threads are routinely seen as terrible and
useless. But in this case, you can see that they allowed me to parallelize
the program without too much trouble, having different sections
execute concurrently.

Considerations

The good news is that this demonstrates how using threads can be
effective when you're doing numerous, time-intensive I/O actions.
This is especially good news if you're writing a server in Python that
uses threads; you can open up a new thread for each incoming request
and/or allocate each new request to an existing, pre-created thread.
Again, if the threads don't really need to execute in a truly parallel
fashion, you're fine.

But, what if your system receives a very large number of requests? In
such a case, your threads might not be able to keep up. This is
particularly true if the code being executed in each thread is CPU-intensive.

In such a case, you don't want to use threads. A popular option—indeed,
the popular option—is to use processes. In my next
article, I plan to look at
how such processes can work and interact.

Reuven Lerner teaches Python, data science and Git to companies around
the
world. His free, weekly "better developers" email list reaches thousands
of
developers each week; subscribe here. Reuven
lives with his wife and children in Modi'in, Israel.