Converting numbers to strings the hard way

Turning numbers into strings can be a big waste of time and money. Paul Khuong wrote about this in How to Print Integers Really Fast (With Open Source AppNexus Code!). If you’ve bought into the JSON mania you’re probably serializing numbers as strings quite a bit without even thinking about it. Much of his discussion is special to C or C++ where you (think) you directly control the hardware. Since this is also a typical interview problem I thought I’d work on it in Perl 6 with a few extras.

The solutions I found convert everything to decimal numbers. There are probably some hexadecimal converters out there. That’s not so interesting. How about a converter for arbitrary bases with arbitrary digits? And, I can’t use any of the builtin stuff to make this work (and so far I haven’t peeked to see how Rakudo does it).

The solution is to decompose a number into parts and convert that partto a string with a lookup table. Put together all your string parts to get the full number. Optimizing that lookup table is most of the problem but I don’t worry about that right away.

I create a factory that takes a single string where each character represents a digit; the number of digits is the base. I don’t care which characters are in the string. Maybe they aren’t in order of their codepoints, or they are repeated, or some other weird thing. Specify a stupid string and get stupid number serializations. If I don’t specify a string I’ll default to the decimal digits. From that I get the base that I’ll use to decompose the number and I’ll break up the string into characters in an array so I can use the numeric value as the index to map it onto its string value. I’ll collect the string values then join them:

This doesn’t convert numbers to strings. It creates code that converts numbers to strings and returns that anonymous subroutine. That’s what a factory does. I’ve also limited this to zero and positive integers but handling sign is not a big deal if I need it.

The div operator for integer division is quite a handy complement to the % modulus to get the remainder. I thought briefly about defining my own operator to return the remainder and reduce the starting number in one step, but maybe I’ll come back to that in some other post.

If I were doing this in C++ I’d care about that because the bits in physical storage probably care about them. The original posts were using binary-coded decimal that used four bits per digit. Add the same nybble to each of those and you get the ASCII character for that digit. Clever. Since Perl 6 doesn’t let me look at the physical representation I can skip that part of the problem (although I could connect to external libraries). That also means that I don’t have to allocate any memory for my final string. I don’t get to make those choices although the C++ program thinks hard about that.

Remember that my use of a literal hexadecimal number there is a red herring. The parser converts that into an internal integer and doesn’t care how it started. I’m converting internal numbers to strings.

So let’s make some fancy strings. I’ll make a hash with labels for the representation as keys and their code as the value (it’s all in a gist), but that might diverge from the code in this article):

As I mentioned earlier I don’t get to think as hard in Perl 6 as C++. I’m much farther away from the iron so choosing registers wisely and other things that hardware cares about. Andrei Alexandrescu writes about this in Three Optimization Tips for C++ where he solves the same problem.

Part of these C++ solutions involve decomposing the number into larger chunks (powers of the base) and translating chunks with a longer lookup table. In C++ they wanted to minimize division operations, which probably makes less sense in a much higher level language like Perl 6. Instead of decomposing 8936 into 8, 9, 3, 6 I can decompose it into 89 and 36 and translate those numbers directly without more math. There’s a trick; if I break up the number 1307 into 13 and 07, I can’t translate 07 to 7. That zero needs to stay there.

How big that lookup table is depends on many things including the typical size of your numbers. A smaller table requires you to do more divisions but a larger table might involve more searching. I used an array but I could have used a hash. No optimization is probably best for any situation. You have to know something about your numbers. If you always have numbers that take up all 256 bits (a SHA perhaps) you have a different optimization than numbers mostly under 100.

But now here’s another thing I can do. I can break up a really big number into much larger chunks and provide those smaller numbers as a sequence. I then format those separately (perhaps in parallel) by calling .hyper on that sequence. The .hyper allows parallelism (where available) and keeps everything in order (unlike .race which doesn’t). I reässemble the chunks at the end.

The big-chunks-sequence sequences makes the smaller numbers and I’ve moved the factories into a module to get them out of the way (gist).