Strings consume a lot of memory in any application. Especially the char[] containing the individual UTF-16 characters is contributing to most of the memory consumption of a JVM by each character eating up two bytes. It is not uncommon to find 30% of the memory consumed by Strings, because not only are Strings the best format to interact with humans, but also popular HTTP APIs use lots of Strings. With Java 8 Update 20 we now have access to a new feature called String Deduplication, which requires the G1 Garbage Collector and is turned off by default. String Deduplication takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them. Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

So how does this work in practice? First you need the Java 8 Update 20 which was just recently released.

For our convenience we do not need to add up all data ourselves but can use the handy totals calculation. The above snippet is the forth execution of String Deduplication, it took 16ms and looked at about 120k Strings. All of them are new, meaning not yet looked at. These numbers look different in real apps, where strings are passed multiple times, thus some might be skipped or have a hashcode already (as you might know the hash code of a String is computed lazy). In above case all strings could be deduplicated, removing 4.5MB of data from memory. The Table section gives statistics about the internal tracking table, and the Queue one lists how many requests for deduplication have been dropped due to load, which is one part of the overhead reduction mechanism.

So how does this compare to String Interning? I blogged about how great String Interning is for memory efficiency. In fact the String Deduplication is almost like interning with the exception that interning reuses the whole String instance, not just the char array.

The argument the creators of JDK Enhancement Proposal 192 make is that often developers do not know where the right place to intern strings would be, or that this place is hidden behind frameworks. As I wrote, you need some knowledge where you typically encounter duplicates (like country names). String Deduplication also benefits duplicate Strings across applications inside the same JVM and thus also includes stuff like XML Schemas, urls, jar names etc which one normally would assume not appear multiple times.

It also adds no runtime overhead as it is performed asynchronously and concurrent during garbage collection, while String Interning happens in the application thread. This now also explains the reason we find that Thread.sleep() in above code. Without the sleep there would be too much pressure on GC, so String Deduplication would not run at all. But this is a problem only for such sample code. Real applications usually find a few ms spare time to run String Deduplication.

This is actually important when you want to correctly split a String. Using charAt() might corrupt your data, as some characters (CJK languages) need 2 code units. What you want in that case is offsetByCodePoint(), but you give up O(1) string indexing in return.

It is a tricky topic, and confusing the terminology is easy, I just wanted to point out that there is some value in being pedantic about it.