Post navigation

String concatenation behind the scenes, part one

The first algorithm I ever worked on in the C# compiler was the optimizer that handles string concatenations.(Unfortunately I did not manage to port these optimizations to the Roslyn codebase before I left; hopefully someone will get to that!) It’s all pretty straightforward, but there are a few clever bits that I thought I might discuss today.

First off, let’s consider just straight up concatenation of string variables; in fact, let’s assume that all variables are strings for the purposes of this article. If you have:

r = a + b;

then clearly that is going to be generated as

r = String.Concat(a, b);

But what about

r = a + b + c;

? String addition, like all addition in C#, is left associative, so this is the same as

r = (a + b) + c;

So you might think that would be generated as

r = String.Concat(String.Concat(a, b), c);

but the compiler can be more efficient than that because there is an overload of String.Concat which takes three operands and produces the same result:

r = String.Concat(a, b, c);

Ditto for four operands. However when we get to five or more operands the compiler switches to a different overload:

r = a + b + c + d + e;

is generated as

r = String.Concat(new String[5] { a, b, c, d, e });

It’s a bit unfortunate that you have to take on the cost of allocating an array, but there it is.

That’s all very straightforward, but there are more optimizations that the C# compiler performs. The main optimization depends upon the fact that string concatenation is fully associative. (And that String.Concat has no observable side effects.) That is, the parentheses don’t matter. You might think that an oddity like

r = a + (b + c);

would have to be generated as

r = String.Concat(a, String.Concat(b, c));

but of course we can observe that since string concatenation is associative, this is no different than

r = (a + b) + c;

which we already know how to generate efficiently. The C# compiler’s string optimizer automatically rewrites nested string concatenations into the left-parenthesized form no matter how you parenthesize them. Except when it doesn’t, which is our next optimization. Let’s now consider the impact of constants:

r = "A" + "B" + c;

The compiler will parse this as:

r = ("A" + "B") + c;

and of course that first term is a compile-time constant according to the spec. The compiler automatically computes values of compile-time constants, so this is the same as:

r = String.Concat("AB", c);

But what about this?

r = a + "B" + "C";

There are two compile-time constants, but this is parsed as

r = (a + "B") + "C";

Since both addition operators have a non-constant element on their left side, you might think this would have to be generated as

r = String.Concat(a, "B", "C");

But again associativity comes to the rescue. The optimizer notices that the original expression is the same as

r = a + ("B" + "C");

and generates

r = String.Concat(a, "BC");

And it will even do the right consolidation of constants “in the middle”:

r = a + "B" + "C" + d;

is generated as though you’d said:

r = (a + ("B" + "C")) + d;

to produce

r = String.Concat(a, "BC", d);

The C# compiler also knows that empty string constants are the identity elements of concatenation; they are eliminated. If you say

r = a + "" + b;

Then the code generated is not

r = String.Concat(a, "", b);

but rather simply

r = String.Concat(a, b);

Null string constants are likewise identity elements:

const string NullString = null;
r = a + NullString + b;

is similarly generated as

r = String.Concat(a, b);

OK I must admit I just made a bald-faced lie there; did you catch it?

The null and empty strings are not identities of concatenation. An identity of an operator has the property that it leaves the other operand unchanged; zero is the identity of addition of numbers because any number plus zero gives you the original number. But there is one possible operand that is not preserved by addition with an empty or null string: the null string! NullString + NullString and NullString + "" both result in the empty string, not the null string, so neither is an identity.

You might think then that

r = a + NullString;

must be generated as

r = String.Concat(a, NullString);

because the compiler needs to ensure that if a is null then the result is the empty string. In fact the C# compiler anticipates what String.Concat will do and simply inlines it. This is generated the same as:

Not quite. It is equivalent to x == null ? "" : x.ToString(). I didn’t mention that corner case but the optimization there is that it is reduced to String.Concat(x), using this overload, which is a convenient method for this case.

Why would it not be CLS compliant? My understanding of CLS compliance is that a CLS-compliant assembly must be for all practical purposes fully usable by any language that implements the minimum requirements. Abstract or virtual methods should not expose non-CLS types, and functionality exposed by methods or overloads that include non-CLS types should be duplicated by methods or overload which do not. I don’t think there’s any prohibition against a CLS-compliant class using non-CLS types, or calling overloads of outside code that use non-CLS types. If one will be linking with a framework which is guarantee to include the __arglist overload, and if it’s more performant than the Object[] one, why not use it?

I’m pretty sure somebody once wrote:
“The sole by-design purpose of __arglist is so that you can safely do interop calls to and from libraries written in C++ that use C-style var args. Do not use it for anything but that.”
(See http://stackoverflow.com/q/910585/124386 – first comment on the question.)

Does this mean that it is OK to have things like “r = a + b;” in your code? I’ve once read that you should replace it with “r = String.Concat(a, b);” because that’s (marginally) more performant. But if the compiler is clever enough to optimize it, I might as well keep using the + operator, right?

Readability is more important than pre-optimizations.
“String.Concat” clutters your codebase for no valuable gain.
Let the compiler optimize your code. Refactor to less readable code only if you need to.

The + operator has to be implemented somehow, and apparently it is implemented in terms of string.Concat. So I’d say replacing it with String.Concat is not useful. However, in case you are building a larger string with many concatenations, you should probably use a StringBuilder.

The only reason (in both Java and C#) to use a StringBuilder directly is if you’re using loops, because last time I checked at least the Java JIT wasn’t clever enough to generate optimal code.

That would probably be a rather useful optimization though (and would allow us programmers to make the code clearer), but I’m sure it’s not as simple as I’m thinking right now, certainly some evil edge case in there.

This optimization is detectable by checking String.IsInterned. The documentation for interning says that the intern pool consists of “a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.” In the case of “A” + “B”, the string that is interned is “AB”, which appears to contradict the documentation as stated, since the literal string constant “AB” does not appear in the program. It was introduced by the optimizer. I assume this is a documentation error, but it might surprise people who are using IsInterned for business logic.

I also have a pathological example where a string literal never enters the intern table (nor is any compiler-optimized concatenation of that string with any other string), which this margin is too narrow to contain.

Back in 2008 I had to write an application that did a lot of string manipulations and I ran into a lot of performance issues. When I used the Visual Studio Profiler, it told me that the chief offender was the string concatenation and the advice I had googled up was to use StringBuilder instead of string1 + string2. I did just that and the performance improved by 50%. Since then I almost always use StringBuilder when I need to concatenate more than 2 strings. I wonder, is there anything special about the way StringBuilder does the concatenation? The article that recommended it to me said that it does not create temporary copies (whereas the ‘normal’ concatenation does). So this is my question: is there a difference between the ‘normal’ concatenation and the way StringBuilder.Append() works? Thanks in advance.

Strings are immutable, so normal string concatenation creates a *new* string. At least naively, this means string concatenation has to copy the contents of both source strings to a completely new string object (string could be using some fancy data structure internally to avoid this, but my guess is that it does not). At the least it means that a new string object is allocated.
StringBuilder is mutable, it contains a buffer that can be changed / appended to directly. The Append() method does exactly that, it changes the StringBuilder by appending something to its internal buffer. The original contents of the StringBuilder don’t need to be copied, and there is no extra object creation.
So to summarize, a = b + c involves three String objects. The strings referred to by b and c remain unchanged, and a *new* String is assigned to a.
sb.Append(c) only involves one StringBuilder and one String, and the StringBuilder itself is changed.
Usually the simplicity of working with plain strings outweighs the performance benefit of StringBuilder, but if you concatenate long strings together or do many concatenations (in particular in loops) the advantage of StringBuilder can become significant.

If code says `someString = someString+”1″;` [both variables of type `string`], no other variable anywhere in the entire universe holds a reference to the string object referred to by variable `someString`, and the amount of space allocated for that object is large enough to accommodate another character, then having the code add another character to the the string object referred to by `someString` would be equivalent to generating a new string object containing the characters in `someString` plus the digit ‘1’. If, however, there exists somewhere in the universe some other reference to that same string object, modifying it cause code which expected to hold a reference to the unmodified sequence of characters to instead hold a reference to the modified sequence. Neither the compiler nor runtime have facilities for keeping track of whether a variable holds the only extant reference to an object, and so both are generally required to assume that such references might exist.

When code says `myStringBuilder.Append(“1”);` the `StringBuilder` knows that it holds the only reference to its buffer anywhere in the universe. Since nobody who holds a reference to the `StringBuilder` will be expecting to hold a reference to its unmodified value, it’s free to modify its buffer as it sees fit without having to create a new object.

Although there is some “magic” within the StringBuilder class which allows better performance than could be achieved without it, such magic is a relatively small part of why code that uses it can perform much better than code which builds strings by repeated concatenation.

CPython uses reference counting, so it can know that there is only a single reference to a given string. Thus, it can make the optimization of not having to copy a string if there is enough room after it to do the concatenation. Presumably other runtimes with immutable strings that use reference counting perform a similar optimization.

could be interesting as well, even if I they do assume compiler knowledge of the framework (but then again, so do the string concatenation optimizations).

I often have to review code written by others, and have written custom Code Analysis rules for these (and many other more interesting issues). Obviously, I am now experimenting with and counting on Roslyn CodeIssueProviders for this sort of thing.

In fact, it could be argued that things like

r = a + null;

might be better dealt with in a CodeIssueProvider, than in the compiler itself.

Wow. That awakward day when you learn that strings in C# don’t make a monoid… or do they? Is “null” even a string? If not, then “pure” strings do form the free monoid (over the set of Unicode characters, sans issues with non-BMP stuff). However, “NullString + NullString != NullString” still breaks down Nullable<T> monad. I can understand why it was made this way, however — I still remember that one program where I concatenated like 30 strings together in SQL, and was extremely frustrated that I when I got a sudden NULL as a result, I had to look thorugh all 30 operands to find that one unlucky NULL-string. Wrapping every string in COALESCE(s, “”) solved the problem.