Iterating Java Characters

A char is not always a full Unicode character. I touched lightly on this before, but the point should be repeated. Unicode has evolved, and the Java char definition has too. If you need to iterate through a string of characters, you have lots of ways to interpret that functionality. You could mean any of these and maybe other options too:

maybe you mean UTF-16 code units

maybe you mean glyphs on the screen

maybe you mean Unicode code points

For this blog, let’s assume the latter; you need to iterate through Unicode code points.

Mistaking a Char for a Code Point

I recently ran across some code that confirmed that business names contain valid characters. The code looked something like this:

The code authors were surprised when isValidBusinessName failed to accept all printable Unicode characters. The problem is that the method parses the string one code unit at a time. It doesn’t always test full code points. Instead, when the business name contains characters outside the basic multilingual plane, the algorithm incorrectly tests surrogate values.

Correctly Iterating Code Points

A more-correct way to parse the business name is by code point. That code looks similar, but is slightly different:

This algorithm grabs code points from the text string and doesn’t incorrectly work on orphaned surrogate values. Surrogate pairs represent a single character, and you should typically not separate the two halves.

Checking Twice

Chances are that you have code like this in your product. With all the new emoji characters and so many interesting and valid characters above the basic multilingual plane, you may want to reconsider how you parse text. If your application should allow those additional characters, you’ll need to revisit areas that parse strings or validate individual characters.

Good luck in your work. You’re not alone in this. Lots of great help already exists in the excellent Javadocs. Make sure you look at both the Character and String classes again if you haven’t looked at their documentation in a while. They both have new methods that will help you work with code points instead of code units.