When doing case-insensitive comparisons, is it more efficient to convert the string to upper case or lower case? Does it even matter?

It is suggested in this SO post that C# is more efficient with ToUpper because "Microsoft optimized it that way." But I've also read this argument that converting ToLower vs. ToUpper depends on what your strings contain more of, and that typically strings contain more lower case characters which makes ToLower more efficient.

In particular, I would like to know:

Is there a way to optimize ToUpper or ToLower such that one is faster than the other?

Is it faster to do a case-insensitive comparison between upper or lower case strings, and why?

Are there any programming environments (eg. C, C#, Python, whatever) where one case is clearly better than the other, and why?

10 Answers
10

Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.

Yes StringComparer is great, but the question wasn't answered... In situations where you can't use StringComparer such as a swtich statement against a string; should I ToUpper or ToLower in the switch?
–
joshperryFeb 22 '09 at 19:51

6

Use a StringComparer and "if"/"else" instead of using either ToUpper or ToLower.
–
Jon SkeetFeb 22 '09 at 20:50

3

John, I know that converting to lower case is incorrect, but I had not heard that converting to uppercase is incorrect. Can you offer an example or a reference? The MSDN article you linked to says this: "Comparisons made using OrdinalIgnoreCase are behaviorally the composition of two calls: calling ToUpperInvariant on both string arguments, and doing an Ordinal comparison." In the section titled "Ordinal String Operations", it restates this in code.
–
Neil WhitakerMar 18 '11 at 15:59

2

@Neil: Interesting, I hadn't seen that bit. For an ordinal case-insensitive comparison, I guess that's fair enough. It's got to pick something, after all. For culturally-sensitive case-insensitive comparisons, I think there'd still be room for some odd behaviour. Will point out your comment in the answer...
–
Jon SkeetMar 18 '11 at 16:02

1

@Triynko: I think it's important to concentrate primarily on correctness, with the point that getting the wrong answer fast is usually no better (and is sometimes worse) than getting the wrong answer slowly.
–
Jon SkeetSep 15 '11 at 18:19

Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).

In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing UPPER is no different from lower.

If you were sneaky and loaded your strings into long arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.

Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)

To answer the question why I need to know which is faster: I don't need to know, I merely want to know. :) It's simply a case of seeing somebody make a claim (such as "comparing upper case strings is faster!") and wanting to know whether it is really true and/or why they made that claim.
–
ParappaOct 24 '08 at 18:06

@bjan The reason is because it's bad not to.
–
Ian BoydFeb 7 '13 at 0:27

What group of characters? What does make a round trip even mean?
–
johvJul 8 '14 at 14:23

1

@johv From the link: "To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters." What group of characters? I don't know, but i'm going to guess the lowercase i in Turkish, when becomes İ, rather than the I that you're used to. Also, we're used to uppercase I becoming i, but in Turkey it becomes ı.
–
Ian BoydJul 8 '14 at 14:56

@IanBoyd Right, but in Turkish they seems like the mapping would be one-to-one, both for upper and lower case. At first I thought they meant german ß, but for that one the problem is the traditional lack of an uppercase version, right? (And nowadays there is an uppercase version for it in Unicode too, with ẞ vs ß.)
–
johvJul 10 '14 at 8:09

Microsoft has optimized ToUpperInvariant(), not ToUpper(). The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.

I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.

If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.

It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.

Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.

It Depends.
As stated above, plain only ASCII, its identical.
In .NET, read about and use String.Compare its correct for the i18n stuff (languages cultures and unicode). If you know anything about likelyhood of the input, use the more common case.

Remember, if you are doing multiple string compares length is an excellent first discriminator.

This is completely wrong - OR'ing with 32 only works for A-Z and characters 64-127; it screws up all other characters. AND'ing with 32 is even more wrong - the result will always be 0 (nul) or 32 (space).
–
Adam RosenfieldOct 24 '08 at 18:34