Motivation

Current article aims to reveal how Compact Strings feature added in JDK 9 with JEP 254 behaves in case of applications where the Strings are predominant UTF-16.

Context

As you might now, in JDK 9 the internal representation of the String class has changed from a UTF-16 char[] array to a byte[] array plus a coder flag field. The new String class stores characters encoded either as ISO-8859-1/Latin-1 (using one byte per character) or as UTF-16 (using two bytes per character) and the coder field indicates which one is used.

This new internal String representation (i.e. using a byte[] array instead of a char[] array) allows a new scheme of compacting Strings on their construction which basically tries to use one byte instead of two bytes for ISO-8859-1/Latin-1 Strings (saving the overall Strings footprint).

By default, when a new String is created, first it attempts to compress the input char[] to Latin-1 by stripping oﬀ upper bytes (i.e. each character backed by one byte). If it fails, UTF-16 encoding is used where each char spreads across 2 bytes. The code looks like below (snapshot from java.lang.String class):

Compressing Strings always happen by default since COMPACT_STRINGS field is implicitly true. However, it can be overridden when starting the JVM with -XX:-CompactStrings flag.

Microbenchmark

I have created a small test to concatenate multiple UTF-16 Strings and I measured the time elapsed with Compress Strings feature enabled (default JDK 9 settings) and disabled (i.e. -XX:-CompactStrings). The code sample below:

Conclusions:

in case of Compact Strings enabled, it takes more time (i.e. 44.469 ns/op) to concatenate the same UTF-16 Strings in comparison to the case where Compact Strings is disabled (i.e. 35.785 ns/op). And the time might increase with the number of UTF-16 Strings from the application: more UTF-16 Strings are concatenated or created more time it takes, hence less optimal!

this happens because it tries to compress and it always fails since there are only UTF-16 Strings which cannot be compressed. Even if COMPACT_STRINGS field would be constant folded away by Just in Time Compiler, the explicit call to StringUTF16.compress() method still happens and takes time without any benefit in this case

in both cases the allocation rate is the same (e.g. 168 B/op), hence almost the same throughput of producing Strings

This leads to an interesting sum-up: for applications that extensively use UTF-16 characters, it might be worth it to consider disabling Compact Strings feature for a better performance! However, you should not exclusively rely on this, instead, my advice is just to keep this in mind and test if it better fits or not in your application.