Since I've been considering getting into software development I've recently been working on different interview questions. I found this one posted on reddit and I liked it so much I decided to implement it in code. Here's my train of thoughts that lead to my solution. I've been trying to refine my software development skills and I'd love to get everyone's opinion on how I did and if there's anything I could refine further.

Problem: How do you sort and store 1 million 8 digit numbers into 1 MB of ram

So some sort of compression is going to be needed. If you rearrange the formula above and solve for the number of bits required for roughly 1 MB of Ram you get the following.

1 million * 8bits = 976.5625 kilobytes

so we have roughly 1 byte to store a single number 2^8=256 The problem with binning is that you would have to have a whole lot of bins.

27bits-8bits = 19bits worth of bins 2^19 = 524288.

Worse than that is the average number of numbers per bin

1 million / 2^19 = 1.907 numbers per bin.

Due to the random variability of the input it is almost garunteed that the bins will not have an equal number of numbers per bin even if the bins had only 3 spots in them that is 3*2^19*8 bits = 1.5 MBs. This is also a mathematical statement of how dense the numbers are. There are 99,999,999 possible numbers so the average spacing between the numbers is 99. Maybe a better way of storing the numbers is not by bins and the absolute value of the numbers but by their relative spacing from one another.

Differential Encoding

Take for example an "array" with a couple numbers in it. Let's say we shall use the numbers...

5,10,30,45

If we were to store them as just integers we would need 32bits per int. in stead of representing their total value lets use the spacing between them instead.

0 |==5==| 5 |==5==| 10 |==20==| 30 |==15==| 45

we can now repack our numbers as the difference between our original numbers

5,5,20,15

While at first glance it may seem as though we haven't really made our numbers any easier to store the difference becomes more apparant with larger numbers

This is a great way to pack densely spaced numbers together. There are several other neat little advantages to this type of encoding. First and foremost is that there is almost no memory overhead involved with storing the data structure. It's only a flat array with no trees or anything like that.

I had to make a couple tweaks to our model to make it work on our system. First off we can't use variable length integers since that would require extra space to store the size of the integer. We only have 8 bits to store our 27 bit number but the problem with 8 bit numbers is that would set a maximum distance between numbers as being only 255. Obviously there are going to be cases where the distance between adjacent numbers is going to be significantly more than 255 (since the max size of our numbers is 9999999). We get around this by setting a rule where if the current by is 255 it doesn't contain a number.

This is our first major limitation of this type of encoding. If the numbers inserted into the array are greater than 255 we have to waste bytes as "spacers". Fortunately we know from one of the above calculations that there should be roughly 1 number per 99 worth of range.

Another limitation of this type of encoding could be the limit of how large of a number we could theoretically represent. a quick calculation shows that the max number that can be represented by a million bytes of our encoding is....

1 million * 255 ~ 255,000,000 max range that can be represented by an array like this.

but that is with an unoccupied array. What about an array where there is an average of 99 between the numbers?

1 million * 99 ~ 99,000,000

That's cutting it close but it should work.

Recursion vs. Loops: so I've run into performance problems/scaling problems. While my simple recursive version of the insert function worked for a small number of bytes it didn't scale much beyond that. First being that using recursive functions causes stack overflows.

A recursive function is a function that calls it self such as seen below.(note that it calls itself)

The problem with this type of a function is that the compiler has to keep track of all active function calls so if our function calls it self a million times the compiler has to keep track of 1 million active function calls, which it can't keep track of so it crashes. So first step was to replace those loops with for loops. this replaces the recursive call with a repeating loop so that no extra calls are made.

BinsNext Problem I'm facing is that insertions take ~O(n). Which is fine with a couple thousand bytes but in the array but not when there's a million or so bytes insertions will take forever. Currently with a million byte array and 10,000 numbers to add it take's ~42 seconds. So to sidestep this dilemma I'm going to use bins to break the array into chunks (so much for saying bins wouldn't work). so I'm just going to say that I'm going to want bins with roughly 50,000 numbers in them each. so

Multithreading : So performance is still a problem. Inserting a million numbers into a ~million byte array takes~6 minutes so I've decided to multithread the code. Clearly any microcontroller with only 1 meg of ram is going to have a few cores laying around to speed things up. The reason for leaving this optimization for near last is that it is tricky to get right and oftentimes extremely difficult to get right. The trickiest part of getting multithreading working is to keep all the workers from stepping on each others toes. It is really easy to get into a situation where one thread is modifying a value that another is working on or reading. Fortunately our decision to use bins earlier makes this a whole lot easier. Each bin is fully independent of the other bins so each thread can be assigned to a single bin and there's no worries that one thread is going to get in the way of another.

I've implemented this in code by having one thread distribute numbers to the bins via a series of queues. once numbers start arriving in the queues the bin thread takes the number and inserts it into its own array.

Most small microcontrollers are in fact single-core (and have clock speeds much lower than a typical desktop CPU). That said, the problem does not state whether the memory restriction is due to running on an embedded system, or just a desire to optimize memory usage within a larger system.

That was my attempt at sarcasm : ) It clearly did not work. The multithreading portion was more of an attempt to figure out how threading worked in C#. Every project I work on I've been trying to incorporate more threading, just so I have more practice under my belt.

Most small microcontrollers are in fact single-core (and have clock speeds much lower than a typical desktop CPU). That said, the problem does not state whether the memory restriction is due to running on an embedded system, or just a desire to optimize memory usage within a larger system.

That was my attempt at sarcasm : ) It clearly did not work. The multithreading portion was more of an attempt to figure out how threading worked in C#. Every project I work on I've been trying to incorporate more threading, just so I have more practice under my belt.

Heh. I've been known to miss sarcasm, especially in forum posts.

Good for you on the threading though. IMO proper use of multiple threads is an area where a lot of software developers are horribly deficient. Given that we seem to be hitting a wall on clock speed (and going to ever-increasing numbers of cores to compensate), multi-threaded apps are the future!

Stranger wrote:Problem: How do you sort and store 1 million 8 digit numbers into 1 MB of ram

Pretty easy actually. 1M 8 digit numbers is going to take up 1,000,000 bytes. You have 48,576 bytes left over to use for scratch space to implement your favorite sorting algorithym. An 8 digit number only takes up one byte.

Now, if the question specified an 8 digit number, base 10, then it's another problem altogether.

Stranger wrote:Problem: How do you sort and store 1 million 8 digit numbers into 1 MB of ram

Pretty easy actually. 1M 8 digit numbers is going to take up 1,000,000 bytes. You have 48,576 bytes left over to use for scratch space to implement your favorite sorting algorithym. An 8 digit number only takes up one byte.

Now, if the question specified an 8 digit number, base 10, then it's another problem altogether.

So I'm considering rewriting this project to improve performance and to make the data structure a tiny bit more compressed and to make the insertions closer to Log(n) rather than n.

I'm considering replacing the bins with something a bit more elegant. So I'll start with a giant array instead of a bunch of smaller bits and instead of having bins which contain a range of numbers (ex: 0-99999999), the array will be divided into partitions based on byte locations in the array. What this accomplishes is that if one bin were to over flow it can push numbers into the next higher bin. Each of these new bins would contain an int that represent the numerical value of the first byte in the array, this would allow the binary search tree to be formed from the bins which would allow near lg(n) insertions. Each bin would also contain a queue that would store insertion operations that would come from a thread dish out the ints as well as from the adjacent bin.

I am also considering switching from 8 bit bytes to 7 bit bytes. Since the average spacing between integers is 99 it should be able to be mostly represented by 7 bits(2^7=128). This would enlarge the array by 8/7 by 14%. This would be off set by the fact that any spacing that is greater than 128 would now need a second byte to represent it. I wish I understood that math better so I could calculate the distribution of numbers and come up the an expected distribution of spacings. Unfortunately this optimization is probably going to have to be put on hold since I have no idea of how I can get a 7bit primitive in C# in a graceful way.

Your differential encoding scheme is clever, but I think it fails if given adversarial input. For example, what if the given numbers are "1, 99999999, 1, 99999999, ...."? Since you'll need 4 bytes to encode each difference, the encoded set won't fit in 1 MB.

Stranger wrote:I wish I understood that math better so I could calculate the distribution of numbers and come up the an expected distribution of spacings.

Write a program that generates a million random numbers, sorts them, and then generates a histogram of the spacing distribution. Run it a couple of times.

What is the worst case? If you had a sequence 0,256, 512, ... your algorithm would store 1 number every two bits. There is a limit to how long the sequence can last before you hit the upper limit, and the rest of the numbers then only need 1 byte. But can you fit them into your 1MB array?

PrecambrianRabbit wrote:Your differential encoding scheme is clever, but I think it fails if given adversarial input. For example, what if the given numbers are "1, 99999999, 1, 99999999, ...."? Since you'll need 4 bytes to encode each difference, the encoded set won't fit in 1 MB.

You can sort them, so your numbers become 1,1,1,1,....,99999999, 99999999, 99999999, 99999999,

But the spacers only cover gaps of 256 as additions. You could have multiplicative spacers, but how do you distinguish between them?

Stranger wrote:Unfortunately this optimization is probably going to have to be put on hold since I have no idea of how I can get a 7bit primitive in C# in a graceful way.

Who cares. Use 8 bit ints to store 7 bit ints, and create an array 14% longer to work in. Prove the concept works, then worry about the bit twiddling later.

PrecambrianRabbit wrote:Your differential encoding scheme is clever, but I think it fails if given adversarial input. For example, what if the given numbers are "1, 99999999, 1, 99999999, ...."? Since you'll need 4 bytes to encode each difference, the encoded set won't fit in 1 MB.

You can sort them, so your numbers become 1,1,1,1,....,99999999, 99999999, 99999999, 99999999,

The question is "how to sort the numbers". If part of your solution is "sort the numbers", that seems to beg the question .

Storing an ordered list of numbers is much easier than storing an unordered list in the original order. Insertion sort is an obvious first choice if you don't have the list of input in RAM but are reading it from a stream. Possibly really expensive computationally, but you don't have much storage to waste on the sort. Anything you have to spare can be used for a merge sort. If you do have the ram space to sort it before storing it, then quick-sort the input. But I don't expect the order of the input numbers to break the storage scheme, just make it slower.

The point of this question is the sort, not the store. The answer to the storage part is "using disk". (Storing 1M 4-byte integers in 1MB is just not possible, in general. There's more entropy than bits to encode it.) Sorry, I realized that's only true if the array is unordered. The differential encoding scheme above applied to a sorted list would use less than 1 MB because the sortedness of the list removes entropy. Still, I maintain that the sort is the crux of this problem, since if you're given an unordered list you have to produce the sorted list, and that's not easy when you're memory constrained .

For the sort, radix sort is a good algorithm for sorting integers with a known range -- now, how would you make radix sort work when you can't fit the whole data set in memory at once? (I think the term for this type of sort is a distribution sort.)

OK, so I guess the storing is actually interesting after all, based on my edit above. The encoding I would use is a variable length integer, encoded as a series of bytes where the topmost bit indicates a stop bit. I think that should allow you to encode each offset using an average of 8 bits per offset, which will fit in the memory requirements.

JBI: Just saw your post. You're totally right. I think I was focused on the sort because it seemed like the classic external sort problem, and if you're doing an external sort then you have a disk, so storing isn't a problem. I should've remembered what happens when I assume .

PrecambrianRabbit wrote:Your differential encoding scheme is clever, but I think it fails if given adversarial input. For example, what if the given numbers are "1, 99999999, 1, 99999999, ...."? Since you'll need 4 bytes to encode each difference, the encoded set won't fit in 1 MB.

In my implementation I sort the numbers as they come in, that still leaves the problem of adversarial input. From my understanding it is extremely difficult to develop a single compression algorithm that doesn't have corner cases. I'm pretty sure my algorithms biggest weakness would be if every single number was the max value. so I would need bunch of bytes as spacers up to 99,999,999 and then I would have a million bytes for the numbers. that would be....

99,999,999/255 + 1000000 = 1392156.8 bytes or ~13 megabytes

I did some reading on the math involved(a bunch of people have written about solving this problem now it seems). So there is

MAXVALUE Multichoose NumOfNumbers = 10^(2.44*10^6)

which is a huge number but we can express all those different possibilities with a single number that is...

LG2( 10^(2.44*10^6) ) = 0.96484MB

but for use to encode the data like this we would have to have a look up table for all 10^(2.44*10^6) possible combinations. This table is probably not going to be able to fit in 1 MB of RAM. While this calculation doesn't give us an easy way to solve this problem it does suggest that this is a solvable problem.

sschaem wrote:1) make 255 the escape code, followed by the non delta value2) since the problem is to 'store' in 1BM not process, the runtime can use any representation. you only need a fast packing function.

That's not a bad idea. A slight tweak might be to allow 2 or three 255 values and then use a 32 bit int to represent the next value. This would get rid of a bunch of nasty corner cases.

Ari Atari wrote:Hey, just ran your code on a Phenom II X6 @ 4GHz and this is what I got:Time elapsed: 00:00:23.9129309Anyway, good luck!

PrecambrianRabbit wrote:You're totally right. I think I was focused on the sort because it seemed like the classic external sort problem, and if you're doing an external sort then you have a disk, so storing isn't a problem. I should've remembered what happens when I assume .

I was actually reading a couple other solutions people proposed and one of the more ingenious ones was using the ping file system to store the number. esentially the system works by storing data along with pings sent out across the globe

PrecambrianRabbit wrote:Your differential encoding scheme is clever, but I think it fails if given adversarial input. For example, what if the given numbers are "1, 99999999, 1, 99999999, ...."? Since you'll need 4 bytes to encode each difference, the encoded set won't fit in 1 MB.

In my implementation I sort the numbers as they come in, that still leaves the problem of adversarial input. From my understanding it is extremely difficult to develop a single compression algorithm that doesn't have corner cases. I'm pretty sure my algorithms biggest weakness would be if every single number was the max value. so I would need bunch of bytes as spacers up to 99,999,999 and then I would have a million bytes for the numbers. that would be....

99,999,999/255 + 1000000 = 1392156.8 bytes or ~13 megabytes

I did some reading on the math involved(a bunch of people have written about solving this problem now it seems). So there is

MAXVALUE Multichoose NumOfNumbers = 10^(2.44*10^6)

which is a huge number but we can express all those different possibilities with a single number that is...

LG2( 10^(2.44*10^6) ) = 0.96484MB

but for use to encode the data like this we would have to have a look up table for all 10^(2.44*10^6) possible combinations. This table is probably not going to be able to fit in 1 MB of RAM. While this calculation doesn't give us an easy way to solve this problem it does suggest that this is a solvable problem.

Really interesting. I'm not sure I get the math - how does one arrive at the 10^(2.44*10^6) value?

I think I've got a scheme that will give a worst case of 2 MB... My idea is to use a run of zeros as a unary encoding of how many bits will be used to represent the following number, and then encode the number using just those bits. So encoding 5 would be: 000101, and encoding 128 would be 0000000010000000. I think the worst case for this algorithm is a whole run of differences-by-128, which uses 2 MB to store. If the differences are bigger than 128 you "eat up" the state space more quickly, so you'll need smaller numbers at the end.

Of course, this still isn't good enough . I'm thinking that the trick is to focus on a really tight encoding for numbers around 100, and try not to be too inefficient for the rest of the space. I'm starting to feel like my information theory background is woefully inadequate for this problem.

just brew it! wrote:Not sure why you think the point is the sorting (not the storage). To me, the statement of the problem seems to imply both.

Its left to interpretation. Since you cant have the data fit in 1MB unsorted, it must reside somewhere.So I have to assume the data is stored in its native format, and its 'easy' to sort in place using any of many algorithm. like quicksort.So you only need to write a clever storage function.

If Google said "You are given 1 million values, sequentially, over time, and need to store them sorted in a 1MB buffer"then yea I would write an insertion function. Also the question didn't say "the fastest way to ..." so I wouldn't spend time optimizing.

My guess is that you get the 'bonus' point by proving mathematically that your solution work in all cases.

Maybe I'm missing the point, but can you make an array of bytes and use the address (index) to encode the number? You can then write 1s or 0s to that location if that number is found. So for instance, if we see the number 576, we make array[576]=1. If we have to handle duplicates in the input then we can just increment the number (and we can handle up to 255 duplicates without problems...). You can always subtract off the address of the first element if we cannot assume the first element starts at memory location 0. Insertion is O(1), and reading the sorted list is O(N).The program which does any computations on this will be a messy (depending on what we want to do), but in terms of storage it is efficient.

Turkina wrote:Maybe I'm missing the point, but can you make an array of bytes and use the address (index) to encode the number? You can then write 1s or 0s to that location if that number is found. So for instance, if we see the number 576, we make array[576]=1. If we have to handle duplicates in the input then we can just increment the number (and we can handle up to 255 duplicates without problems...). You can always subtract off the address of the first element if we cannot assume the first element starts at memory location 0. Insertion is O(1), and reading the sorted list is O(N).The program which does any computations on this will be a messy (depending on what we want to do), but in terms of storage it is efficient.

You don't have enough locations in a 1 MB array to represent all possible 8 digit numbers. Yes, you only have 1 million numbers, but they can be 8 digits -- which specific 8 digit number does each array location correspond to? If you've already used up most of the memory for your counters you don't have enough memory left to map the array locations to the 8-digit numbers they are supposed to represent. You also can't represent more than 255 duplicates of any single value.

Seems pretty clear to me that some form of delta encoding (as others have suggested) is the way to go.

Turkina wrote:Maybe I'm missing the point, but can you make an array of bytes and use the address (index) to encode the number? You can then write 1s or 0s to that location if that number is found. So for instance, if we see the number 576, we make array[576]=1.

unfortunately the range of numbers makes this type of encoding unfeasible. if we used 8 bits per number we would need....

8 bits * 99 999 999 = 95.3674307 megabytes

even if we used 1 bit per entry...

1 bit * 99 999 999 = 11.9209288 megabytes

for a 1 mb array with 8 bit entries there would have to be....

1MB/8bits=1048576

(or roughly a 6 digit number)

PrecambrianRabbit wrote:Really interesting. I'm not sure I get the math - how does one arrive at the 10^(2.44*10^6) value?

multichoose is a derivative of choose from probability. It's in essence is a nice convenient formula to solve problems like this

to be honest the math is a bit beyond my pay grade but using it is not too difficult, just plug in the number of different possible values and the number of numbers and you get the total number of different possible combinations.

PrecambrianRabbit wrote:I think I've got a scheme that will give a worst case of 2 MB... My idea is to use a run of zeros as a unary encoding of how many bits will be used to represent the following number, and then encode the number using just those bits. So encoding 5 would be: 000101, and encoding 128 would be 0000000010000000. I think the worst case for this algorithm is a whole run of differences-by-128, which uses 2 MB to store. If the differences are bigger than 128 you "eat up" the state space more quickly, so you'll need smaller numbers at the end.

Of course, this still isn't good enough . I'm thinking that the trick is to focus on a really tight encoding for numbers around 100, and try not to be too inefficient for the rest of the space. I'm starting to feel like my information theory background is woefully inadequate for this problem.

Another slight derivative of your idea could be to use 254 value as the spacer byte and the 255 value to indicate that the next 32 bits will an int with the distance to the next byte. This would greatly reduce the worst case situation.

Also I was doing a bit more and I figured out the most obvious optimization that I totally should have thought of. The idea is to use a small buffer where you sort incoming ints. Once the buffer is full and sorted, the buffer is essentially merge sorted into the encoded data. This skips having to O(n) insert every number into the array.

PrecambrianRabbit wrote:Really interesting. I'm not sure I get the math - how does one arrive at the 10^(2.44*10^6) value?

multichoose is a derivative of choose from probability. It's in essence is a nice convenient formula to solve problems like this

Oh, gotcha! I was thrown off by the right-hand side, since for multichoose I expected factorial notation, but now I follow. Interesting stuff. I also don't see how to encode the permutation as an integer directly, though, unfortunately.

Stranger wrote:Another slight derivative of your idea could be to use 254 value as the spacer byte and the 255 value to indicate that the next 32 bits will an int with the distance to the next byte. This would greatly reduce the worst case situation.

Yep, that definitely helps. Getting under 1 MB is a real trick though. I've been thinking about this while simulations run in the background at work, and started wondering if some sort of compressed data structure like a prefix tree would be useful. Although, I think the answer to that is probably "no".

PrecambrianRabbit wrote:My idea is to use a run of zeros as a unary encoding of how many bits will be used to represent the following number, and then encode the number using just those bits. So encoding 5 would be: 000101, and encoding 128 would be 0000000010000000. I think the worst case for this algorithm is a whole run of differences-by-128, which uses 2 MB to store. If the differences are bigger than 128 you "eat up" the state space more quickly, so you'll need smaller numbers at the end.

Nice. But how do you store identical numbers? - I guess you can use 1 as a repeat code, so 00101110010 would be an increase of 2, repeated a further 3 times, then another increase of 2? But still a factor of 2 out on the memory budget.

I would also suggest cheating, and having more than one storage structure. It takes more logic to walk all the structures in order, but would allow you to pull any hard edge cases out. Like having a list of repeated numbers, or looking for sequences than you can encode in less space as a begin, difference and end.

Alright I'm going to re write my code and I'm going todo essentially the same as above but this time I'm going to encorporate a sorted prebuffer and instead of a single flat array I'll use a circular buffer.

the new merge function will work as follows. First the prebuffer is sorted next a function will start to decode and delete whatever data is currently in the array. The encoder selects the next lowest value to encode, whether from the prebuffer or the old values in the array. At the next available spot in the circular buffer a new value is encoded and stored. This is continued until both the old values and the prebuffer have been drained and reencoded and stored. This process is repeated until all values are store in the circular buffer.

previously every single new int would require an entire reencoding of the stored values, whiich is a demanding task. The use of the prebuffer and the circular buffer reduces the number of needed reencodings to NUMBEROFINTS/SIZEOFPREBUFFER. There is several additional performance benefits to this. In my previous attempt at this problem a significant ammount of time was spent messing around with the back of the arrays where no values were at. This Method only encodes as many bytes as it needs.

|| {old encoding}---------{new encoding}-----||

Additionally I'm going to modify my encoding slightly. Data is still going to be stored in bytes such that most reads are aligned(I think) but I'm going to have an additional special value.

now

1-253 = value of the difference between the two numbers in the array(plus 254 per previous spacer byte)255=next 32bits are an int that represents the value of the next delta. This reduces the worst case to just 40 bits254= spacer bit

To determine the optimal point at which to switch from the spacer byte encoding to the 8bit indicator +32 bit int type of encoding we need to solve first for the case where these two encodings result in same number of bits used

40bits = (delta/254)*8bits + 8bits

delta = 1016

so any value less than 1016 will be encoded with spacer bytes while any value larger than that will be encoded with 32bit integer.

I also went and just plugged in the distribution into a graphing program and got this.

according to this distribution there is...

79726 of a million greater than 256 7.9%279781 of a million greater than 128 27.9%

I think this puts to rest the idea of using 7 bit numbers for the encoding. It seems like an 8 bit encoding would catch 92.1% of all deltas

PrecambrianRabbit wrote:I think I've got a scheme that will give a worst case of 2 MB... My idea is to use a run of zeros as a unary encoding of how many bits will be used to represent the following number, and then encode the number using just those bits. So encoding 5 would be: 000101, and encoding 128 would be 0000000010000000. I think the worst case for this algorithm is a whole run of differences-by-128, which uses 2 MB to store. If the differences are bigger than 128 you "eat up" the state space more quickly, so you'll need smaller numbers at the end.

Of course, this still isn't good enough . I'm thinking that the trick is to focus on a really tight encoding for numbers around 100, and try not to be too inefficient for the rest of the space. I'm starting to feel like my information theory background is woefully inadequate for this problem.

I will say that you are dangerously close to another implementation that I read about that will get you under the 1 MB mark.

The key is to use those prebits as a swicth like you were doing i.e. a zero bit is used like a spacer byte in my implementation once a 1 bit is read in another 7 bits are read in as the delta, which is added to the value specified by the number of spacer bits.

I've come up with a slight derivative of that idea that catches worst cases much more gracefully

The prebits are used to indicate the number of bits to read in i.e.

1 = read 7 bits01= read 9 bits001 = read 12 bits0001 = read 32 bits

This greatly reduces the worst case situation since its an exponential type of system rather than an additive system like the one above. for example...

99,999,999 would be encoded in the above example would require 781,249 0s followed by 7 bits. In my encoding all that would be required would be the 001 precode followed a 32bit int. This should get rid of almost all of the worst case situations.

New update. Now that I've re wrote the whole thing I can encode 1million ints into 1MB in about 5.49 seconds on 1 1.7GHz bobcat core. so with 1/4th the cores and 1/2 the processor speed its 9x as fast(roughly a 72X speedup). Plus with the new encoding scheme it should be much more resilient to adverse inputs.