IT Sky Consulting GmbH

Hauptmenü

Beitrags-Navigation

Indexing of long utf-8 files or strings

The UTF-8 format has the disadvantage that the length of characters and code points varies. Accessing a given position counted in characters is only possible by starting from the beginning or by providing an indexing structure. It is a good idea to find a balance between size and speed. So indexing blocks of several kilobyte size and scanning through them to reach the exact desired position should achieve both goals. The ideas described here can also be applied for compressing a long string or file by using the compression for these multi-kilobyte-blocks separately. In a way utf-8 is a compression scheme for allowing to store many of the four byte long code points in one or two or three bytes. Let us just assume we have characters and bytes. It could be arbitrary length numbers and bytes. And let us talk about compressed and uncompressed blocks.

Grouping and indexing can be done by having approximately a certain number of bytes in the compressed block, probably the largest number of characters that fit into the block. Now blocks can start at multiples of their maximum size and for each block we have meta information about its position in the chain of blocks, its uncompressed size and the position of its first entry in a totally uncompressed file or string. If the indexing structure is kept in memory only, this is quite easy to achieve. If it is stored in the file, some approach might be chosen: Add an additional indexing file to each „compressed“ text file, add the meta information in the beginning of the file, thus limiting the number of blocks at create time, or add the meta information to the end of the file. Or in some file systems add it into a second stream, which seems to be the cleanest approach, but only applicable for some file systems that are not main stream in the Unix- and Linux-world.

The other way around the uncompressed data can be grouped into blocks and then stored in a dense way. Then the meta information needs to contain the address of the first byte of the block. Both approaches seem to be reasonable.

It should be observed that random access reading is possible with this approach, but random access writing is doomed to fail, so it cannot be easily supported, unless the blocks can be stored anywhere and the meta information is used to contain all the information, current block size in bytes, current block size in characters, current location of first byte, position of block in the chain, number of the first character. When inserting characters, all metainformation needs to be updated and the block in which the change happens needs to be rewritten completely, but there could be ways to achieve this.

The question arises if this has been standardized in some way. I have not heard about such a standard, but I am observing that dealing with UTF-8 is always a mess or that this issue is just neglected. Databases have off course answered these kind of questions in their way quite well, but I think there could be a useful standard for compressed random access text files, maybe even supported by the libc or by the Posix standard.

If a typical compression is applied, it is possible to have a common dictionary for frequently occuring blocks and their encoding in one place of the file and just store the encoded data in the actual blocks, avoiding duplication. That would require a third area for storing data.