We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.

We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.

Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.

You have an array with all the numbers from 1 to N, where N is at most 32,000. The array may have duplicate entries and you do not know what N is. With only 4KB of memory available, how would you print all duplicate elements in the array?

My initial thoughts:
4KB = 32768 bits. We can use a byte array to represent if we have seen number i. If so, we make it 1. If we encounter a duplicate, we would have marked the particular position with 1, so we would know we have a duplicate. Then we print it out.

Solution:
We have 4KB of memory which means we can address up to 8 * 4 * (2^10) bits. Note that 32*(2^10) bits is greater than 32000. We can create a bit vector with 32000 bits, where each bit represents one integer.
NOTE: While this isn’t an especially difficult problem, it’s important to implement this cleanly. We will define our own bit vector class to hold a large bit vector.

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory.
FOLLOW UP
What if you have only 10 MB of memory?

My initial thoughts:
In Java, an integer is 32-bit. 4 billion integers would quickly eat up 15 GB of memory. Therefore we cannot put them all in the main memory. We can instead use a boolean array. Each cell in position i of the array represent whether number i is present in the file or not. Then when generating new numbers, we just need to do a search through the array to find a spot whose value is false, meaning that number is absent from the file. However, in Java, the maximum length of an array is around 2^32. So we can make a 2D array with indexing to do this.

There are a total of 2^32, or 4 billion, distinct integers possible. We have 1 GB of memory, or 8 billion bits.
Thus, with 8 billion bits, we can map all possible integers to a distinct bit with the available memory. The logic is as follows:

It’s possible to find a missing integer with just two passes of the data set. We can divide up the integers into blocks of some size (we’ll discuss how to decide on a size later). Let’s just assume that we divide up the integers into blocks of 1000. So, block 0 represents the numbers 0 through 999, block 1 represents blocks 1000 – 1999, etc. Since the range of ints is finite, we know that the number of blocks needed is finite.
In the first pass, we count how many ints are in each block. That is, if we see 552, we know that that is in block 0, we increment counter[0]. If we see 1425, we know that that is in block 1, so we increment counter[1].
At the end of the first pass, we’ll be able to quickly spot a block that is missing a number. If our block size is 1000, then any block which has fewer than 1000 numbers must be missing a number. Pick any one of those blocks.
In the second pass, we’ll actually look for which number is missing. We can do this by creating a simple bit vector of size 1000. We iterate through the file, and for each number that should be in our block, we set the appropriate bit in the bit vector. By the end, we’ll know
which number (or numbers) is missing.
Now we just have to decide what the block size is.
A quick answer is 2^20 values per block. We will need an array with 2^12 block counters and a bit vector in 2^17 bytes. Both of these can comfortably fit in 10*2^20 bytes.
What’s the smallest footprint? When the array of block counters occupies the same memory as the bit vector. Let N = 2^32.