Launching External Processes in Python

In past articles, I've looked into concurrency in Python via threads (see "Thinking Concurrently: How Modern Network Applications Handle Multiple Connections" and "Threading in Python"). The good news with threads is that they are relatively easy to work with and let you share data among threads without too much trouble. The bad news is that if you're not careful, you can end up with serious problems—because data isn't shared, and Python data structures aren't thread-safe. But perhaps a bigger problem is that Python's global interpreter lock (GIL) guarantees that only one thread runs at a time.

In many cases, this isn't really a problem. In particular, if you're writing programs that work with the filesystem or network, you probably won't feel the pain of Python threads too badly. That's because while only one thread runs at a time, a thread gives up control of the CPU whenever it uses I/O. This is because disks and networks are many times slower than CPUs; while you're waiting for the filesystem to give you the data you've requested, another thread can be running.

That said, there definitely are times when Python's threads show their limitations. In particular, if you're writing code that is CPU-bound—that is, in which the CPU is the bottleneck—you'll find that threads are limited. After all, if you have a nice 48-core machine with which to play, doesn't it seem silly to have only one of those cores actually doing something?

There is, of course, a solution to these problems—one that many traditional UNIX users consider to be superior under many circumstances: processes. Rather than run a function in a new thread, run it in a new process!

So in this article, I take an initial look at working with processes in Python to do a very common task: invoking external commands. In so doing, I also cover how working with processes is structured, leading to my next article's topic: the "multiprocessing" module.

Process Basics

For Linux users, nothing is more basic and everyday than a process. When I fire up Emacs, I start a process. When I start the Apache HTTP server, I start a process, which then starts multiple, additional processes. When I invoke ls on the command line, I'm starting a process. And when I tell my computer to shut down, it does so by killing each of those processes.

Think of a process as a data structure that represents a computer at a particular moment in time. A process has code that is running (including code that has yet to run); it has data on which the program works; it has access to memory to store and retrieve additional data, and it can talk to external devices, from filesystems and networks to keyboards and screens.

A single Linux machine can run many, many processes at once. For the split second during which a process runs, it has the illusion of having complete control over the computer. It's thanks to the fact that modern computers are so fast that you can run so many processes and yet have them all appear to be running concurrently. True, modern computers have multiple CPUs (aka "cores"), which lets you divide the work among those cores.

There are all sorts of ways to start processes in Python. In modern versions of the language, you can use the "subprocess" module to start up a process and even retrieve the result. For example, you can invoke the ls program in a new process and then view the results:

>>> subprocess.check_output('ls')

From this function, you get a string containing the output from the ls command. It's a big ugly one to see, especially if you're used to seeing things printed nicely. In such a case, you don't want to view the string that was returned, but rather to print it. The thing is, that doesn't seem to work, at least not in Python 3:

>>> print(subprocess.check_output('ls'))

The problem is that, by default, subprocess.check_output returns a "bytestring", similar to a Python 2 string, in that it contains a sequence of bytes, rather than a sequence of Unicode characters. The issue here is that when you print a bytestring, Python doesn't actually go to a new line when it sees \n.

You can get around this problem by telling Python to interpret newline characters liberally and to return a string instead of a bytestring:

>>> print(subprocess.check_output('ls', universal_newlines=True))

This seems to work quite nicely. But what if you want to print only a subset of the files in the current directory? It seems natural to want to say, for example, ls -l. Let's try that:

What's wrong here? Very simply, Python is trying to run an external process, giving it the Linux command ls -l. You might think that this is normal and reasonable, since running ls -l is something you likely do all the time in your day-to-day lives. But remember that ls is the command, and -l is a flag to that command. You can understand the difference, and the shell typically separates them for you. But if you simply hand that command name to Linux, it's going to get confused and complain.

So instead of passing a single string, you'll need to pass a list of strings, in which each represents a "word" of the command. For example:

It complains that "*.txt" isn't a legitimate file. That's because while you might think that Linux always knows that * represents all of the files in a directory, that's not the case—it is the shell that performs the interpretation of such characters as "*", dividing things up and then passing them along to the underlying operating system.

So, how can you list all of the files with a "*.txt" suffix? You can invoke the same call once again, but tell Python to pass the parameters through the UNIX shell:

So, what happened here? This started a new process (a "subprocess", if you will), and in that process, executed a UNIX program. The program returned some text, that Python captured, and then printed it out.

The Python documentation makes it clear that having shell=True in your call to subprocess.check_output (and other functions) is a potential security risk. If you're getting input from an unknown or untrusted user, that person can insert arbitrary commands into the system on which check_output is running. Be sure to consider the security implications of shell=True before using it.

More Generally

subprocess.check_output is a specific function, one that's designed to run a program and retrieve its output. If you want a bit more flexibility, you can run other functions from "subprocess".

For example, let's say you want to take the output from ls and put it into a file. On the UNIX command line, you could say:

ls -l > file-list.txt

In Python, this is a bit more complex, but not terribly so if you use subprocess.run This function is new (as of Python 3.5), but it makes life a bit easier.

You can try this:

>>> subprocess.run(['/bin/ls', '-l'], universal_newlines=True)

As you can see, subprocess.run takes many similar arguments to subprocess.check_output. But what's different is that it doesn't return a string, even when universal_newlines is set to True. Instead, it returns an instance of subprocess.CompletedProcess, which contains all sorts of information about the process that ran.

Hmm, that's likely not quite what you wanted. The args is fine, and returncode is accurately showing 0, meaning that everything ended just fine. But what happened to the output? The answer is that when it comes to subprocess.run, you need to indicate where the output should go.

The way to indicate that you want to get something back is to pass subprocess.PIPE as the value of the stdout keyword argument:

I'm not even going to show you the rest, because it's so long, but the stdout value is precisely right.

You also can assign stderr to subprocess.PIPE in order to receive it. Note that in the case of both stdout and stderr, you can assign not just subprocess.PIPE, which lets you grab and work with the program's output, but also an open (writable) file object. This means you can invoke an external process and put its output into an arbitrary file. I'd argue that most of the time, the reason you would be executing an external process in Python is that you want to do something to the text, but this will work.

You might be wondering whether you can not only write to stderr and stdout, but also read from stdin. And the answer is definitely. Just provide a file object, and subprocess.run will do the rest. For example:

In this case, you run /bin/cat with the -n option, numbering the lines of a file. What's the input file? /etc/passwd. And where does the output go? To your subprocess.PIPE object, which is a kind of communication channel to external processes.

For me, the most interesting thing is the CompletedProcess object (cp), from which you can grab different pieces of information about the completed process. Note that subprocess.run will return only after the external program has finished running, at which point the cp variable will be set. And from there, you can grab stdout, which is normally a bytestring, but which is an actual (Unicode) string if you set universal_newlines to True.

Conclusion

You've now seen how you can use the "subprocess" module to communicate with external processes. But let's face it. This doesn't exactly solve the initial problem: breaking a problem up and using different processes to handle it. Rather, this shows, at some level, how Python works with processes and the basic ways in which it communicates with them, using bytestrings and pipes. That's because processes are separate and cannot simply share variables with the main thread, which is what you're doing when using threads.

Reuven Lerner teaches Python, data science and Git to companies
around the world. You can subscribe to his free, weekly "better
developers" e-mail list, and learn from his books and courses at
http://lerner.co.il. Reuven lives with his wife and children in
Modi'in, Israel.