Building the right thing, building it right, fast

Who I am

My name is Jakub Holy and I’m a software craftsmanship enthusiast and a (mainly JVM-based) developer since ~ 2005, consultant, and occasionally a project manager, working currently with Iterate AS in Norway. More under About and #opinion.

Posts Tagged ‘database’

I have finally managed to understand one of the most unusual databases of today, Datomic, and would like to share it with you. Thanks to Stuart Halloway and his workshop!

Why? Why?!?

As we shall see shortly, Datomic is very different from the traditional RDBMS databases as well as the various NoSQL databases. It even isn’t a database – it is a database on top of a database. I couldn’t wrap my head around that until now. The key to the understanding of Datomic and its unique design and advantages is actually simple.

The mainstream databases (and languages) have been designed around the following constraints of 1970s:

memory is expensive

storage is expensive

it is necessary to use dedicated, expensive machines

Datomic is essentially an exploration of what database we would have designed if we hadn’t these constraints. What design would we choose having gigabytes of RAM, networks with bandwidth and speed matching and exceeding harddisk access, the ability to spin and kill servers at a whim.

Often you need to insert a String from Java into a database column with a fixed length specified in bytes.
Using

string.substring(0, DB_FIELD_LENGTH);

isn’t enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.

Approach 1: Replace the invalid trailing byte(s) with a ‘rectangle’

The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 – therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn’t valid and this 1 byte was replaced by \uFFFD (3B).

Approach 2: Skip the invalid trailing byte(s) altogether

If you don’t want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:

Approach 3: Manually remove the invalid trailing bytes

As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I’d be glad for the code :-)