Surely if A equals B, and B equals C, then A equals C; that's the transitive property of equality. It appears to have been thoroughly violated here.

Well, first off, though the transitive property is desirable, this is just one of many situations in which equality is intransitive in C#. You shouldn't rely upon transitivity in general, though of course there are many specific cases where it is valid. As an exercise, you might want to see how many other intransitivities you can come up with. Post 'em in the comments; I'd love to see what obscure ones you can come up with. (Incidentally, one of the interview questions I got when applying for this team was to invent a performant algorithm for determining intransitivities in a simplified version of the 'better method' algorithm.)

Second, what's happening here is we're mixing two different kinds of equality that just happen to use the same operator syntax. We're mixing reference equality with value equality. Objects are compared by reference; in the first and third comparison we are testing if the two object references both refer to exactly the same object. In the second comparison we are checking to see if the two strings have the same content, regardless of whether they are the same object or not. In fact, the compiler warns you about this situation; this should produce a "possible unintended reference comparison" warning.

That might need a bit more explanation. In .NET you can have two strings that have identical content but are different objects. When you compare those strings as strings, they're equal, but when you compare them as objects, they're not.

That explains why the second comparison is true -- it's a value comparison -- and why the third comparison is false -- it's a reference comparison. But it doesn't explain why the first and third comparisons are inconsistent with each other.

This is the result of a small optimization. If you have two identical string literals in one compilation unit then the code we generate ensures that only one string object is created by the CLR for all instances of that literal within the assembly. This optimization is called "string interning".

String.Empty is not a constant, it's a read-only field in another assembly. Therefore it is not interned with the empty string in your assembly; those are two different objects.

This explains why the first comparison is true: the two literals in fact get turned into the same string object. And it explains why the third comparison is false: the literal and the computed value are turned into different objects.

Knowing that, you can now make an educated guess as to why we have this bizarre behaviour:

Some versions of the .NET runtime automatically intern the empty string at runtime, some do not!

But why, you might ask, do we not perform this interning optimization at runtime on every string? Why not aggressively turn all value-equal strings into reference-equal strings? Surely it is wasteful to have two identical strings around when you could have half as much memory.

The answer is that the TANSTAAFL Principle applies here, bigtime. That is, There Ain't No Such Thing As A Free Lunch. Interning has two positive effects: it decreases memory consumption and decreases time required to compare two strings. (Because if all strings are interned at runtime then all string comparisons can be cheap reference comparisons.) But those positive effects have a cost: allocating a new string now requires that you do a search of all string objects in memory to see if you have one that matches already. In our existing optimization, the cost is small; we can know at compile time what string literals are in a given assembly and which are identical. With the proposed optimization, that cost is imposed at runtime, and it could be a very large fraction of the time spent allocating strings.

In order to keep the time cost down, you'd have to build a hash table of all strings in memory. That means either computing the hashes frequently, which is itself expensive in time, or storing the hashes somewhere. If we do the latter then suddenly we are increasing the memory burden for strings that are not duplicated. That is, our optimization makes the normal scenario -- the vast majority of pairs of strings are not equal to each other -- take up more memory, so that a rare scenario saves on memory. That seems like a bad bargain; you usually want to optimize for the likely case.

There are also serious lifetime problems with interned strings. When can they be safely garbage collected? What if a new copy of the string is created while the old one is being collected on another thread? The safest thing to do is to make interned strings immortal, which looks like a memory leak. Memory leaks are bad for performance, particularly when the optimization you're doing is an attempt to save memory. TANSTAAFL!

In short, it is in the general case not worth it to intern all strings. However, it might be worth it in some specific cases. For example, if you were building a compiler in C#, odds are good that you are going to be producing a lot of strings that are the same at runtime. Our C# compiler is written in C++, in which we have written our own custom string interning layer so that we can do cheap reference comparisons on all strings in your program. Odds are good that "int" is going to appear tens, hundreds or thousands of times in a given program; it seems silly to allocate the same string over and over again. If you were writing a compiler in C#, or had some other application in which you felt that it was worth your while to ensure that thousands of identical strings do not consume lots of memory, you can force the runtime to intern your string with the String.Intern method.

Conversely, if you hate interning with an unreasoning passion, you can force the runtime to turn off all string interning in an assembly with the CompilationRelaxation attribute.

Anyway, to come back to the question of transitivity: object reference equality actually is transitive. It's also symmetric (A==B implies B==A) and reflexive (A==A), so it is an equivalence relation. Similarly, string value equality is transitive, symmetric and reflexive, since it uses a straight "character by character" ordinal comparison. But when you mix the two, then equality is no longer transitive. That's weird, but hopefully now understandable.

I wonder if an empty string is special-cased in some way. No matter what I’ve tried, it looks like an empty string will always refer to the same instance as String.Empty. I got the expected result using other non-interned strings though:

Hmm…I’m surprised that the String class doesn’t cache the hash value already. Granted, I never gave it much thought. But Strings are a common object to use as a key in a hashed structure; I’d think that the nominal overhead would be worthwhile for reasons other than interning, and thus interning could simply take advantage of that.

The question about object lifetime seems less than a "slam dunk" too. That is, yes…the simplest, safest implementation would simply lead to a huge increase in memory usage. But is that really the _only_ implementation?

The time cost problem seems like a much more important point than these other two.

I think in the end, it’s not so much that there’s a clear argument against interning every string at run-time, but simply that there is a vague, general argument against it and no terribly compelling need in favor of it. That is, it _could_ work given enough effort in the implementation, but in the classic cost/benefit analysis, cost is very high and benefit is very low.

In java, the comparison of two string objects using "==" always results in a reference comparison. Therefore string comparison is always done using String.equals(), the same concept of literal pools applies java though.

Sample this:

String str1="xyz";

Object obj1="xyz";

String str2=new String("xyz");

System.out.println(str1==obj1); //true

System.out.println(str1==str2); //false

System.out.println(str2==obj1); //false

System.out.println(str1.equals(obj1)); //true

System.out.println(str2.equals(obj1)); //true

System.out.println(((String)obj1).equals(str1)); //true

I always thought the same was true for C#. Interesting, now I know… Thanks! 🙂

@Franklin, if String.Empty were a constant (IL "literal"), its value would be inserted into IL at compile time – so it wouldn’t be any different from just using "". In particular, it would only be interned once per assembly. But since it’s actually static readonly field, there’s just one single instance shared between all code using String.Empty. I’m not sure if this has any distinct advantages, or if it is even the rationale for making it non-constant, but I can’t think of any other points of difference.

Excellent post. Could you please clarify the following bit for me: "only one string object is created by the CLR for all instances of that literal within the assembly." Are you saying that if assembly A and assembly B both contain the same literal string there will be two copies of this in memory or am I reading this backwards? Because as far as I have been able to observe, that is not the case.

A slight modification of the code at the beginning of the post provides yet another illustartion of the difference between the comparison by reference and the comparison by value:

object obj = "Int32";

StringBuilder sb = new StringBuilder("Int32");

string str1 = sb.ToString();

string str2 = typeof(int).Name;

Console.WriteLine(obj == str1); // False, this time!!!

Console.WriteLine(str1 == str2); // true

Console.WriteLine(obj == str2); // false !?

Well, it’s self-explanatory, pretty much: the call to StringBuilder.ToString() defeats the interning, somehow, so that the two "Int32" do not end up being the same object (I’ve used the VS 2008 SP1, Standard Edition, on 64-bit Windows 7 Ultimate RTM: it may be different on other .NET versions, of course).

> Are you saying that if assembly A and assembly B both contain the same literal string there will be two copies of this in memory or am I reading this backwards? Because as far as I have been able to observe, that is not the case.

You’re right, and I’m wrong (and I have no idea where I got this notion from). In fact, it’s quite obvious now that I think of it – there’s only one string pool, so assemblies don’t matter.

Which, obviously, means that my guess at the rationale of String.Empty is entirely wrong, as well. Back to square one.

@Denis

> the call to StringBuilder.ToString() defeats the interning, somehow

That one is actually pretty straightforward (and Eric has already explained it in the post): only literals (including those produced by constant expressions at compile-time, like "a"+"b") are interned by default. The return value of StringBuilder.ToString() is not a literal.

First off, reference equality doesn’t come into it; you have no reference types at all in this program fragment.

NaN means “not a number”, and NaNs are special. In particular, the floating point standard requires that NaN == NaN be false. Basically, NaN means “the result is unknown or nonsensical.” You have two results which are unknown or nonsensical. Let’s suppose the two results are the total sales for October 10th, which are unknown, and the total sales for February 31st, which are nonsensical. You compare them for equality. Does it make any sense to say “why yes, those two figures are equal!” ? Of course not. So NaNs never equal each other.

Note that “null” in VB has this same property; if you compare null to null in VB, you get null, not true or false.

So why isn’t String.Empty a constant ? (I know this was asked before, but it seems the only answer given was later invalidated). I guess since it appears that String.Empty IS interned it probably doesn’t make any difference, but I’m interested in the answer.

"Apart from the costs" I don’t think there is a reason not to write the C# compiler in C#. But that’s like asking, apart from my height and lack of athletic ability, for what other reason can’t I be an All-Star professional basketball player? You have to live in reality. Cost is usually the reason that desirable things don’t get done, in software and the rest of the world.

In fact the C# team has talked about exposing the compiler as a managed service to aid metaprogramming, scripting, and other scenarios. Anders himself spoke about it at PDC last year. So I imagine we might see it happen. But it has to make it to the top of the priority list, past a whole lot of other desirable things (as Eric has often spoken of).

If I run this code in a new console application, I get the behavior you indicate (true/true/false). When I look at the generated assembly in reflector, however, the CompilerRelaxations attribute is present, with string literal interning disabled [CompilerRelaxations(8)]. So if string interning is disabled, why am I getting the behavior that should only occur if string literal interning is enabled?

As I understand it, the references to const fields are replaced with actual values during compilation time. Const fields need to be of value type with string being some sort of exception. Does the interning rule kicks in when it detects a constant is a string?