Squozen Sort

version 0.2

The question is sometimes raised whether it's possible to sort a set of,
say, one million thirty-two-bit numbers using only, say, two million bytes of
memory. At first blush, that many longs wouldn't seem to fit in that many
bytes, but it can all be done in memory; this page describes one way, implemented in
the program sort_n.c.

To clarify some assumptions: we'll read the numbers in ascii decimal from
standard input, one line per number. They'll be unsigned longs. There may
be duplicates. We'll have some fixed overhead, but the main arrays we sort in will
total two million bytes. The results will be printed on standard output.

Compress as you sort.

What makes this possible is that a sorted list of
numbers contains less information than the same list before sorting.
The number of combinations of
a million items, duplicates allowed, chosen from a set of 232 is about 213,511,283.1
(calculated
here).
It takes about 13.5M / 8 ~= 1.7M bytes to represent one such
choice.
If we can compress our intermediate sorting results nearly that well,
we'll have enough memory to sort in.
Here we use a one-level merge sort
to do the sort as a whole.

Rice-coded differences

Sorting some of the input is the first compression step. The next is to
encode the differences between successive numbers rather
than the numbers themselves.
I call the differences "gaps".
Finally, we need an encoding that takes
advantage of the fact that most of the gaps are small.
The code used here is a
Golomb/Rice code:

[1...] 0 bbbb bbbb bbbb

That is, zero or more one bits, a zero, and twelve binary bits.
The leading ones represent
gap / 4096 in unary, and the
binary bits encode gap % 4096.
For a million gaps that add up
to 232, the code uses at most

232/4096 + 1,000,000*(1+12) = 14,048,576 bits = 1,756,072 bytes,

which is within 4% of the theoretical minimum mentioned above. It's easy to verify that the Rice code works well
here, but the choice wasn't obvious or intuitive to me
beforehand, so
this page shows a systematic path to it.

The merge sort

Given this method of compressing sorted lists, a merge sort seems natural.
The usual merge sort is a recursive binary merge, but the trouble with a
recursive sort in this context is that our code is (and has to be) less
efficient for smaller lists of numbers.
I use a single-level merge here; here's why that's not
terribly bad.

If we allocate a buffer for the final sorted list, the space we have
left is...

2,000,000 - 1,756,072 = 243,928 bytes

...enough for a buffer of 60,982 32-bit numbers.

We will fill and sort the small buffer 1,000,000 / 60,982 < 17 times (but
see below), and each time merge it into the big buffer. That means
that in addition to the the work of sorting the small bufferfulls,
we do about 1,000,000 * 17 / 2 comparisons, 8.5 * n, which is about
half the number of comparisons within the sorts. But we also do about
that number of compressions and decompressions, which are more
expensive.

The growth of the buffer over merge passes looks like a triangle, which
suggests O( n^2 ) space use, but the base of the triangle is set by
the ratio of the two buffer sizes, not by the size of the input.
As long as we're working in memory, and the two buffers' capacities
are k:1, then the total work is

O( n * (k/2 + log( n/k )) ),

which is still O( n log n ).
To merge the sorted numbers into a single buffer that can only fit the
final result, first slide the compressed data to the end of the big buffer.
Start uncompressing the old data, merging with the small buffer of
sorted but uncompressed data, compressing the result, and filling in
the big buffer from the beginning. Old compressed data is
read and processed leaving space for new data that's
been merged.

The slack space between the new data and the remaining old data
shrinks because more data from the small buffer is being inserted. But
because the big buffer is big enough for the final result, we know that
the total of the new and remaining old data will fit without overlap--
the same calculation of the total size is valid at every intermediate
point.

One more optimization: the last merge step doesn't need its results
recompressed into the big buffer, they only need to be sent to the output.
That means the big buffer's capacity can be reduced by the capacity of the
small buffer. That in turn means the small buffer can be larger, which
means the big buffer can be smaller, and so forth. I didn't
look for a closed form or efficient way to find the squeezed buffer size,
I just iterate until I get to a fixed point. For the original problem of
one million longs in two million bytes, this reduces ~17 passes of ~61K
items each to ~10 passes of ~103K.

Tests

The sort will work with less memory (down to a limit) at the expense of
more passes, or with
more memory and fewer passes. The bash script
time_test runs the
same sort problem with buffer sizes from 1,800,000 to 4,100,000 bytes.
With the smallest buffer, 55 passes and 15.2 seconds are needed on my computer.
With a 2MB buffer, 10 passes and 5.6 seconds, and with the largest buffer,
the compression and merging is skipped entirely and the sort takes
4.6 seconds. In each of these, about 3 seconds are spent reading and writing
the file.

That compares to 18.5 seconds for Unix's sort -n, but that's a little unfair
since Unix sort works on the data as strings, sorts by arbitrary combinations
of fields, and can sort a bigger variety of number formats.