Speedy string concatenation in Python

As many people know, one of the mantras of the Python programming language is, “There should be one, and only one, way to do it.” (Use “import this” in your Python interactive shell to see the full list.) However, there are often times when you could accomplish something in any of several ways. In such cases, it’s not always obvious which is the best one.

A student of mine recently e-mailed me, asking which is the most efficient way to concatenate strings in Python.

The results surprised me a bit — and gave me an opportunity to show her (and others) how to test such things. I’m far from a benchmarking expert, but I do think that what I found gives some insights into concatenation.

First of all, let’s remember that Python provides us with several ways to concatenate strings. We can use the + operator, for example:

>> 'abc' + 'def'
'abcdef'

We can also use the % operator, which can do much more than just concatenation, but which is a legitimate option:

>>> "%s%s" % ('abc', 'def')
'abcdef'

And as I’ve mentioned in previous blog posts, we also have a more modern way to do this, with the str.format method:

>>> '{0}{1}'.format('abc', 'def')
'abcdef'

As with the % operator, str.format is far more powerful than simple concatenation requires. But I figured that this would give me some insights into the relative speeds.

Now, how do we time things? In Jupyter (aka IPython), we can use the magic “timeit” command to run code. I thus wrote four functions, each of which concatenates in a different way. I purposely used global variables (named “x” and “y”) to contain the original strings, and a local variable “z” in which to put the result. The result was then returned from the function. (We’ll play a bit with the values and definitions of “x” and “y” in a little bit.)

I should note that concat3 and concat4 are almost identical, in that they both use str.format. The first uses the implicit locations of the parameters, and the second uses the explicit locations. I decided that if I’m already benchmarking string concatenation, I might as well also find out if there’s any difference in speed when I give the parameters’ iindexes.

From this benchmark, we can see that concat1, which uses +, is significantly faster than any of the others. Which is a bit sad, given how much I love to use str.format — but it also means that if I’m doing tons of string processing, I should stick to +, which might have less power, but is far faster.

The thing is, the above benchmark might be a bit problematic, because we’re using short strings. Very short strings in Python are “interned,” meaning that they are defined once and then kept in a table so that they need not be allocated and created again. After all, since strings are immutable, why would we create “abc” more than once? We can just reference the first “abc” that we created.

This might mess up our benchmark a bit. And besides, it’s good to check with something larger. Fortunately, we used global variables — so by changing those global variables’ definitions, we can run our benchmark and be sure that no interning is taking place:

x = 'abc' * 10000
y = 'def' * 10000

Now, when we benchmark our functions again, here’s what we get:

concat1: 2.64µs/loop

concat2: 3.09µs/loop

concat3: 3.33µs/loop

concat4: 3.48µs/loop

Each loop took a lot longer — but we see that our + operator is still the fastest. The difference isn’t as great, but it’s still pretty obvious and significant.

What about if we no longer use global variables, and if we allocate the strings within our function? Will that make a difference? Almost certainly not, but it’s worth a quick investigation:

Once again, we see that + is the big winner here, but (again) but less of a margin than was the case with the short strings. str.format is clearly shorter. And we can see that in all of these tests, the difference between “{0}{1}” and “{}{}” in str.format is basically zero.

Upon reflection, this shouldn’t be a surprise. After all, + is a pretty simple operator, whereas % and str.format do much more. Moreover, str.format is a method, which means that it’ll have greater overhead.

Now, there are a few more tests that I could have run — for example, with more than two strings. But I do think that this demonstrates to at least some degree that + is the fastest way to achieve concatenation in Python. Moreover, it shows that we can do simple benchmarking quickly and easily, conducting experiments that help us to understand which is the best way to do something in Python.

I’m not quite sure how to respond to this comment, other than to say that I’ve been using Python for 20 years, teaching it for about 10, and writing about it for nearly that long. I make tons of mistakes, and very often learn from the people with whom I interact, or are nice enough to point out the flaws in my thinking. I’d be delighted to get such feedback from you, too.

Wow, twice as fast as +? Egad! I thought about testing that as well, but decided against it because it didn’t involve variables.

But wow, maybe I should have.

I also often tell people not to use the implicit concatenation in their programs, because it’s hard to read such programs — and because I’m not sure how useful it really is. (Maybe there’s an example of where it’s a great idea?)

Save my name, email, and website in this browser for the next time I comment.

David Hancock
-
a couple of years ago

I agree with your stance, just striving for completeness of the test. There’s even a PEP (rejected) to remove the implicit concatenation: http://legacy.python.org/dev/peps/pep-3126/. The only use cases that made sense at all were commenting REs (but verbose mode is better) and readable SQL strings (triple-quoting is nearly as good).

The ”.join(x, y) approach in another comment seems nearly tied for second place with concat2:

Save my name, email, and website in this browser for the next time I comment.

David Hancock
-
a couple of years ago

Unrelated note: The math captcha has two minor annoyances: (1) every time I’ve composed a reply, it’s taken a couple minutes. I didn’t use timeit() 😉 The math captcha expired. (2) Sometimes the appearance of one of the number (spelled out) makes it look like the captcha want a spelled-out number. This doesn’t work.

Your tests are kind of weird. I see several funny things with your first approach. First, you’re not only doing your operations (+, %, format(), etc), you’re also calling functions, so you can’t really know how much of the measured time corresponds to the operation and how much to the function call. True, the overhead is constant for all the tests, but then the proportion between the different speeds is off. Second, and this can be a question of personal taste, you’re using global variables instead of parameters. And you don’t explain which version of python/ipython you’re using.

And then you ask: “What about if we no longer use global variables, and if we allocate the strings within our function?”. This really makes no sense, because you’re adding the execution of the str() * int() operator to all the functions. Now you’re measuring something completely different.

I must admit that I’ve always (for reasons I’ve never really thought about) used %timeit with functions. It never occurred to me to just time plain ol’ code, in part because (as you indicated) the function-call overhead would be uniform. But your point about the function-call overhead being so much greater than the actual operation is a very good one.

I was using Python 2 for my tests; I should have used Python 3, and clearly should have stated what version I was using. Moreover, a comparison between versions would seem appropriate.

I’m also rather surprised that the numbered use of str.format is the fastest. I’m not surprised that it is faster than {} {}; my guess is that if you’re explicit, then Python doesn’t need to do as much work.

In short, I greatly appreciate your comment. I knew that I’d mess up somewhere with the benchmarking, and am glad that you took the time to indicate where and how I made some mistakes.

You’re measuring the fastest way to concatenate two long strings. That particular example isn’t especially indicative of the fastest way to concatenate many strings, which is probably the actual information your student was seeking. In my experience, the usual times one needs to concatenate a small number of strings is user input or output, which tends to be rare, overall. On the other hand, concatenating many strings is common when processing significant quantities of data from external sources, an often time-consuming task.

And when it comes to joining large numbers of strings, there’s a clear winner, and it’s not ‘+’:

A way to choose string concatenation method (or other such thing) should be never based on benchmarks:

1) Interpreter implementation changes — all your results are invalid.

2) Ugly, non-idiomatic way to do something makes your project readability worse. More and more such «optimizations» — worse and worse project readability. Less readability — more bugs and more time to modify.

3) In real life you’ll never have problems because of «not fastest» way to concatenate strings. Run profiler in any non-synthetic project and you’ll see anything: I/O, bad DB architecture, O(n^2) complexity — anything, but not a string concatenation. String concatenation — is just not a place where real bottleneck can be.

All above is not something I invented myself: all this written years ago in a wonderful book I sure you know — McConnell’s «Code Complete» (see chapter 25).

Event if you’re agree with things above, I think article should be absolutely clear that a correct way to concatenate strings is only one — idiomatic way, beautiful way («+» in our case). All optimization — later, with profiler and only if you really need it.

Thanks for reading.
Sorry if I might to seem rough, I didn’t want it (english is not my native language).

All of your points are more than reasonable and welcome, except for the first — I wasn’t aiming to do any optimization. I was just curious to know which concatenation method was the fastest. When it comes to my own code, clarity is the most important factor, and I tell this to my students all of the time, in no small part because of the points you made here.

I actually like to use str.format, much more than +, which I find fairly ugly. That’s probably why my student was curious about the relative speeds; I tend to use str.format extensively in my classes, and she was wondering if, given a large program with lots of concatenation, it would make a difference.

Every time I try to do benchmarking, I’m humbled anew. You would think that I would have learned my lesson by now!