(Note: this is written 2 years after the bulk of the work was done on this codebase, so my thoughts will be basic, muddled, confused, and probably fabricated from the lies one tells oneself. You have been warned.)

base*

One way to get around this is by using base64, which takes all the numbers and upper/lowercase alphabet characters to encode the data (plus two more, but we’ll ignore them for now). Now you don’t have any problems with your transport layer trying to intepret your bytes, because everyone handles the alphanumeric characters just fine, and if they don’t you probably don’t want to be using them.

Of course, after using base64 for a while, you might notice that while it’s fantastic that you aren’t having your data mutate at the whims of a capricious god that thinks he’s helping, you’re giving up efficiency by using base64. For every 3 bytes in your binary blob, you have 4 bytes if alphanumeric characters that you need to transmit, so you’re paying for data protection with a 33% size levy: if the mob could skim off 33% in protection rackets, you bet they would.

Looking at base64, you might notice that there are way more ASCII characters that are displayable than just the 64 used by base64, which could be used in encoding in order to make the transport density higher. Thus was base85 born, using… you guessed it, 85 characters! This time around, 4 bytes of binary data get encoded into 5 ASCII characters, which results in a 20% transport size overhead. But what can you do? It’s not like there are more displayable ASCII… oh, wait.

So it appears that no one had tried to take the obvious next step beyond base85, so I decided to try my hand at making base92.

Design

We take a similar approach to the base64 design, and encode a number of bits from the binary stream into an ASCII character. However, instead of using 6 bits for each character (base64), we use… 6.5? That’s kind of weird! What do you do with the 0.5 bit? Here we take a page from base85‘s book, and instead of taking the raw bit stream as a basis for the resulting characters, you convert a chunk of bits to an actual number (13 in our case, 32 in base85‘ case), and then read off the values in base whatever.

A nice property of using only 13 bits can be that base92 is more resistant to transport corruption than base85, since changing one character in the base92 transport will only effect 2-3 bytes after decoding. I haven’t actually run the numbers to find out if one can cause widespread corruption in 4 consecutive bytes with a 1-bit flip of the transport material, but it seems plausible.

A decision remains to be made: which characters should we use? There are in fact 94 traditionally displayable ASCII characters, so if we use 6.5 bits, we have 91 characters to work with. Which three do we
leave out? Making a choice at random is hardly optimal: it makes no difference to a computer, but if the data has to pass through a visual or voice channel, we can reduce some ambiguity. For instance, using only one of 1 or l can make it easier to handle fonts that display those characters as nearly the same. Instead, I chose to disambiguate between the quotes, ` ” and `, since they’re usually very similar, and one of them is merely a doubling of another, a recipe for readability disaster.

Finally, as noted in the README, base91 was just too ugly a name, so I added an optional extra character to get base92.

I believe the library was written in a TDD style, with the handy doctests tool. It’s nifty, keeping your tests with your code and making it available to all only a help() call away, but it does bloat up your docstrings and makes help() less… helpful than a nice, concise docstring that isn’t trying to test all the facets of functionality.

I will confess, using python’s binary string formatting makes me sad inside. If you can believe it, doing massive string munging makes me sadder than it not being in C. One of these days, I’ll hook up the C
library with a python wrapper. One of these days.

On the bright side, I believe this was my second project that I uploaded to pypi!

Currently the repo only contains a companion C library. I know, I can just make wrappers to everything else with a fine C library. Uh, maybe you can contribute! I’ll see you in a bit!

Prior art

It turns out that I fail at scholarship, and that there has been an effort to create a basE91 over at Sourceforge, and that the project started in 2006, which is way before base92 was even a glimmer in my young impressionable eyes. Oh well.

Usage

Eh, you actually want to use this? Well, if you’re interested in using base92, you can grab it from github. The README should have details on how to use it, so I won’t repeat myself here.