Sunday, 7 September 2014

The Oracle database (RDBMS) has a "wonderful quirk" - it treats null and empty string as one and the same.

Oracle database empty string

If you have never used Oracle RDMBS you may not be aware of this particular "feature".
Basically, if you insert an empty string into the database it is treated as null.
Thus, a column declared as "NOT NULL" will reject the insertion of an empty string.
For me, and probably most other developers, this isn't what you expect.

The Stack Overflow explanation
is "I believe the answer is that Oracle is very, very old."
As well as being amusing, it is as good an explanation as any other.

Last week, I had the dubious pleasure of porting the OpenGamma database infrastructure to work on Oracle.
It already worked on various other databases (including HSQLDB, Postgres and SQL Server) but Oracle is
rather an outlier in the world of SQL.

Most of the conversion went fine with ElSql
able to handle any deviations from standard SQL.
But the problem of empty strings and null was a little more vexing.

The first question was whether we could avoid needing to identify the difference between null and an empty string?
I believe that if Oracle is your primary database, then you could code your application in that way.
However, if you are coding the application to work on multiple different databases then this is simply not an option.

The second question was of course how to handle the problem?
My research threw up three realistic possibilities:

Prefix all strings by the same character, such as '@'.
The string 'Stephen' would be stored in the database as '@Stephen', while the empty string would be stored as '@'.

Replace the empty string with a magic value, such as '!!! EMPTY !!!'.

Store an additional flag column indicating the difference between null and the empty string.

Option 1 means that every string in the database is polluted, making it hard to use
the data directly by SQL (bypassing the application).
Option 2 relies on the magic value never occurring in real data, and direct SQL access is again compromised (though less than option 1).
Option 3 is the most "pure" solution, but would have been very hard to adopt when writing one piece of code
for multiple different databases.

My chosen solution was a variation on option 2 - encoding using "one extra space".
I'm documenting it here in case anyone else finds it to be a useful strategy:

The empty string is stored in the database as a single space (ASCII 32).

Any string that consists only of spaces (ASCII 32) is stored with one additional space added.

Any other string is stored without change

The advantage of this encoding is that the vast majority of strings are stored without being changed.
This makes access by direct SQL simple and obvious.
The only strings to be encoded are the empty string, and the unlikely "all space" strings.
The encoding will make little difference to most algorithms or displays even if it is not decoded.
The encoding is also fully reversible, provided that column length limits are not hit.

Fortunately, I was able to encode and decode the data in relatively few places.
For decoding, a decorated ResultSet worked effectively.
For encoding, it was possible to create a subclass of Spring's NamedParameterJdbcTemplate
and JdbcTemplate to do the trick.
See this commit for details.

As a final note, there are other complications with data storage in Oracle.
As I understand it, NaN and null also cannot be separated.
This blog only covers problems of strings, which is tricky enough!

Summary

I will continue to believe that an empty string and null are two different concepts.
Oracle RDBMS disagrees.
If you face the problem of creating a workaround, perhaps the option I used is worth considering.