5.5 String Comparisons and Searches

String comparison
performance is
highly dependent on both the string data and the comparison algorithm
(this is really a truism about collections in general). The methods
that come with the String class have a performance
advantage in being able to directly access the underlying
char collection. So if you need to make
String comparisons, String
methods usually provide better performance than your own methods,
provided that you can make your desired comparison fit in with one of
the String methods.
Another necessary
consideration is whether comparisons are case-sensitive or
-insensitive, and I will consider this in more detail shortly.

To optimize for string comparisons, you need to look at the source of
the comparison methods so you know exactly how they work. As an
example, consider the String.equals(
)
and String.equalsIgnoreCase(
) methods from the Java 2 distribution.

String.equals(Object) runs
in a fairly straightforward way: it first checks for object identity,
then for null, then for String
type, then for same-size strings, and then character by character,
running from the first character to the last. Efficient and complete.

String.equalsIgnoreCase(String) is a little more
complex. It checks for null, and then for strings
being the same size (the String type check is not
needed, since this method accepts only String
objects). Then, using a case-insensitive comparison,
regionMatches( ) is applied. regionMatches(
) runs a character-by-character test from the first
character to the last, converting each character to uppercase before
comparing.

Immediately,
you see that the more differences there are between the two strings,
the faster these methods return. This behavior is common for
collection comparisons, and the order of the comparison is crucial.
In these two cases, the strings are compared starting with the first
character, so the earlier the difference occurs, the faster the
methods return. However, equals( ) returns faster
if the two String objects are identical.
It is unusual to check
Strings by identity, but there are a number of
situations where it is useful (for example, when you are using a set
of canonical Strings; see Chapter 4). Another
example is when an application has enough time during string input to
intern( )[5] the
strings, so that later comparisons by identity are possible.

[5]String.intern( ) returns the
String object that is being stored in the internal
VM string pool. If two Strings are equal, then
their intern( ) results are identical; for
example, if s1.equals(s2) is
true, then s1.intern( ) = = s2.intern(
) is also true.

In any case, equals( ) returns immediately if the
two strings are identical, but equalsIgnoreCase( )
does not even check for identity (which may be reasonable given what
it does). This results in equals( ) running an
order of magnitude faster than equalsIgnoreCase( )
if the two strings are identical; identical strings is the fastest
test case resolvable for equals( ), but the
slowest case for equalsIgnoreCase( ).

On the other hand, if the two strings are different in size,
equalsIgnoreCase( ) has only two tests to make
before it returns, whereas equals( ) makes four
tests before it returns. This can make equalsIgnoreCase(
) run 20% faster than equals( ) for what
may be the most common difference between strings.

There are more differences between these two methods. In almost every
possible case of string data, equals( ) runs
faster (often several times faster) than equalsIgnoreCase(
). However, in a test against the words from a particular
dictionary, I found that over 90% of the words were different in size
from a randomly chosen word. When comparing the performance of these
two methods for a comparison of a randomly chosen word against the
entire dictionary, the total comparison time taken by each of the two
methods was about the same. The many cases in which strings had
different lengths compensated almost exactly for the slower
comparison of equalsIgnoreCase( ) when the strings
were similar or equal. This illustrates how
the data and the algorithm interplay with
each other to affect performance.

Even though String
methods have access to the internal chars, it can
be faster to use your own methods if there are no
String methods appropriate for your test. You can
build methods that are tailored to the data you have. One way to
optimize an equality test is to look for ways to make the strings
identical. An alternative that can actually be better for performance
is to change the search strategy to reduce search time. For example,
a linear search through a large array of Strings
is slower than a binary search through the same size array if the
array is sorted. This, in turn, is slower than a straight access to a
hashed table. Note that when you are able and willing to deploy
changes to JDK classes (e.g., for servlets), you can add methods
directly to the String class. However, altering
JDK classes can lead to maintenance problems.[6]

[6] Several
of my colleagues have emphasized their view that changes to the JDK
sources lead to severe maintenance problems.

When
case-insensitive
searches are required, one standard optimization is to use a second
collection containing all the strings uppercased. This second
collection is used for comparisons, obviating the need to repeatedly
uppercase each character in the search methods. For example, if you
have a hash table containing String keys, you need
to iterate over all the keys to match keys case-insensitively. But,
if you have a second hash table with all the keys uppercased,
retrieving the key simply requires you to uppercase the element being
searched for:

However, note that String.toUppercase(
)
(and String.toLowercase( )) creates a complete
copy of the String object with a new
char array. Unlike String.substring(
)
,
String.toUppercase( ) has a processing time that
is linearly dependent on the size of the string and also creates an
extra object (a new char array). This means that
repeatedly using String.toUppercase( ) (and
String.toLowercase( )) can impose a heavy overhead
on an application. For each particular problem, you need to ensure
that the extra temporary objects created and the extra processing
overhead still provide a performance benefit rather than causing a
new bottleneck in the application.