Dabeaz

Tuesday, January 26, 2010

Reexamining Python 3 Text I/O

Note: Since I first posted this, I added a performance test using the Python 2.6.4 codecs module. This addition is highlighted in red.

When Python 3.0 was first released, I tried it out on a few things and walked away unimpressed. By far, the big negative was the horrible I/O performance. For instance, scripts to perform simple data analysis tasks like processing a web server log file were running more than 30 times slower than Python 2. Even though there were many new features of Python 3 to be excited about, the I/O performance alone was enough to make me not want to use it---or recommend it to anyone else for that matter.

Some time has passed since then. For example, Python-3.1.1 is out and many improvements have been made. To force myself to better understand the new Python 3 I/O system, I've been working on a tutorial Mastering Python 3 I/O for the upcoming PyCON'2010 conference in Atlanta. Overall, I have to say that I'm pretty impressed with what I've found--and not just in terms of improved performance.

Due to space constraints, I can't talk about everything in my tutorial here. However, I thought I would share some thoughts about text-based I/O in Python 3.1 and discuss a few examples. Just as a disclaimer, I show a few benchmarks, but my intent is not to do a full study of every possible aspect of text I/O handling. I would strongly advise you to download Python 3.1.1 and perform your own tests to get a better feel for it.

Like many people, one of my main uses of Python is data processing and parsing. For example, consider the contents of a typical Apache web server log:

Let's look at a simple script that processes this file. For example, suppose you wanted to produce a list of all URLs that have generated a 404 error. Here's a really simple (albeit hacky) script that does that:

error_404_urls = set()
for line in open("access-log"):
fields = line.split()
if fields[-2] == '404':
error_404_urls.add(fields[-4])
for name in error_404_urls:
print(name)

On my machine, I have a 325MB log file consisting of 3649000 lines--a perfect candidate for performing a few benchmarks. Here are the numbers that you get running the above script with different Python versions. UCS-2 refers to Python compiled with 16-bit Unicode characters. UCS-4 refers to Python compiled with 32-bit Unicode characters (the --with-wide-unicode configuration option). Also, in the interest of full disclosure, these tests were performed with a warm disk cache on a 2 GHZ Intel Core 2 Duo Apple Macbook with 4GB of memory under OS-X 10.6.2 (Snow Leopard).

Python Version

Time (seconds)

2.6.4

7.91s

3.0

125.42s

3.1.1 (UCS-2)

14.11s

3.1.1 (UCs-4)

17.32s

As you can see, Python 3.0 performance was an anomaly--the performance of Python 3.1.1 is substantially better. To better understand the I/O component of this script, I ran a modified test with the following code

for line in open("access-log"):
pass

Here are the performance results for iterating over the file by lines:

Python Version

Time (seconds)

2.6.4

1.50s

2.6.4 (codecs, UTF-8)

52.22s

3.0

105.87s

3.1.1 (UCS-2)

4.35s

3.1.1 (UCs-4)

6.11s

If you look at these numbers, you will see that the I/O performance of Python 3.1 has improved substantially. It is also substantially faster than using the codecs module in Python 2.6. However, you'll also observe that the performance is still quite a bit worse than the native Python 2.6 file object. For example, in the table, iterating over lines is about 3x slower in Python 3.1.1 (UCS-2). How can that be good? That's 300% slower!

Let's talk about the numbers in more detail. The decreased performance in Python 3 is almost solely due to the overhead of the underlying Unicode conversion applied to text input. That conversion process involves two distinct steps:

Input data (bytes) has to be scanned and characters decoded according to some encoding (UTF-8 by default).

The decoded character data has to be stored as an array of multibyte integers that represent the associated string result.

The overhead of decoding is a direct function of how complicated the underlying codec is. Although UTF-8 is relatively simple, it's still more complex than an encoding such as Latin-1. Let's see what happens if we try reading the file with "latin-1" encoding instead. Here's the modified test code:

for line in open("access-log",encoding='latin-1'):
pass

Here are the modified performance results that show an improvement:

Python Version

Time (seconds)

3.1.1 (UCS-2)

3.64s (was 4.35s)

3.1.1 (UCs-4)

5.33s (was 6.11s)

Lesson learned : The encoding matters. So, if you're working purely with ASCII text, specifying an encoding such as 'latin-1' will speed everything up. Just so you know, if you specify 'ascii' encoding, you get no improvement over UTF-8. This is because 'ascii' requires more work to decode than 'latin-1' (due to an extra check for bytes outside the range 0-127 in the decoding process).

At this point, you're still saying that it's slower. Yes, even with a faster encoding, Python 3.1.1 is still about 2.5x slower than Python 2.6.4 on this simple I/O test. Is there anything that can be done about that?

The short answer is "not really." Since Python 3 strings are Unicode, the process of reading a simple 8-bit text file is always going to involve an extra process of converting and copying the byte-oriented data into the multibyte Unicode representation. Just to give you an idea, let's drop into C programming and consider the following program:

This program does nothing more than iterate over lines of a file--think of it as the ultimate stripped down version of our Python-2.6.4 test. If you run it, takes 1.13s to run on the same log file used for our earlier Python tests.

When you go to Python 3, there is always extra conversion. It's like modifying the C program as follows:

Sure enough, if you run this modified C program, it takes about 1.7 seconds--a nearly 50% performance hit just from that extra copying and conversion step. Minimally, Python 3 has to do the same conversion. However, it's also performing dynamic memory allocation, reference counting, and other low-level operations. So, if you factor all of that in, the performance numbers start to make a little more sense. You also start to understand why it might be really hard to do much better.

Now, should you care about all of this? Truthfully, most programs are probably not going to be affected by degraded text I/O performance as much as you think. That's because most interesting programs do far more than just I/O. Go back and consider the original script that I presented. On Python-2.6.4, it took 7.91s to execute. If I go back and tune the script to use the more efficient 'latin-1' encoding, it takes 13.8s with Python-3.1.1. Yes, that's about 1.75x slower than before. However, the key point is that it's not 2.5x slower as our earlier I/O tests would suggest. The performance impact will become less and less as the script performs more non-IO related work.

Finally, let's say that you still can't live with the performance degradation. If you're just working with simple ASCII data files, you might solve this problem by turning to binary I/O instead. For example, the following script variant uses binary I/O and bytes for most of its processing--only converting text to Unicode when absolutely necessary for printing.

error_404_urls = set()
for line in open("access-log","rb"):
fields = line.split()
if fields[-2] == b'404':
error_404_urls.add(fields[-4])
for name in error_404_urls:
print(name.decode('latin-1'))

If you run this final script, you find that it takes 8.22s in Python 3.1.1--which is only about 4% slower than the Python-2.6.4. How about that!

The bottom line is that Python-3.1 is definitely worth a second look--especially if you tried the earlier Python 3.0 release and were disappointed with its performance. Although text-based I/O is always going to be slower in Python 3 due to extra Unicode processing, it might not matter as much in practice. Plus, binary I/O in Python 3 is still quite fast which means that you can turn to it as a last resort.