Array Duplicates

September 23, 2011

There are essentially two solutions: sorting and searching. In the sorting solution, the vector is sorted in a first stage, then the items in the vector are run through sequentially in a second stage, stopping when an item is equal to its predecessor. In the searching solution, each item in the vector is inserted in an auxiliary data structure, stopping when it is found that the item being inserted is already present. The sorting solution takes time O(n log n) for the sort plus time O(n) for the search (that could be reduced to O(n) if the items are integers, as in the problem statement, but not if the items are some other data type), but only requires as much auxiliary space as the sort needs for the recursive stack, assuming quicksort, which is O(log n); it also has the side effect of scrambling the original order of the vector, unless you make a copy first. The searching solution requires time O(n) to run through the items in the vector, but requires O(n) space to store the auxiliary data structure. Here is our version of the sorting solution, which calls the Bentley/McIlroy vector sort from the Standard Prelude:

The problem statement is not clear that all the integers from one to a million appear in the input vector, though in many variants of the problem that is true; you may wish to ask the interviewer to clarify. If it is true, there is another solution based on Gauss' formula; sum the integers in the array and subtract the sum of the first n integers, so that the difference is the duplicated item.

There are doubtless other solutions, and variants on the problem; one variant is to find a missing rather than a duplicated item. But we’ve given the basic solutions, so we’ll stop there. You can run the program at http://programmingpraxis.codepad.org/sGxhES1h. If you look at codepad, you’ll see that the sorting solution is much slower than the other two. That’s because our sort is relatively slow. Replacing our sort with the system sort drastically improves the time, though it is still not as fast as the other two methods.

Like this:

LikeLoading...

Related

22 Responses to “Array Duplicates”

This sentence is confusing me, do you mean the sort can be reduced to O(n)?
> The sorting solution takes time O(n log n) for the sort plus time O(n) for the search (that could be reduced to O(n) if the items are integers

uh, it’s known that the sum of x from 1 to n is a formula: (n(n+1))/2, right? So, take the sum of your entire input array, subtract off the calculated constant, and you’re left with the duplicate value. No sorting or extra arrays necessary. :-)

@Mike: The integers in the array are not guaranteed to be from 1 to 1000000. They could be any value between 1 and 1000000, and you could have an arbitrary number of array elements. Your method would work otherwise :)

The logic is a mere two lines in racket. Unfortunately, this is assuming that no other integer appears more than once in the array, that there is enough space in memory to hold twice the number of required integers. On the other hand, with Racket’s tail-call optimization, this code runs in linear time (set-operations are constant-time hashing calculations).

My mistake, I meant constant space, not constant time. The recursive calls can’t overflow the stack because of Racket’s use of continuation passing style to avoid ever returning from the tail position. Neat stuff!

Another solution is to sort the list and use a binary search to find the duplicate value. The insight is that the index of an element in the sorted list is either equal to the element or one greater than the element.

Moving the indexes is a little different than vanilla binary search though: