Given a vectors of size $n$ with integer data within $[1, n-1]$ range, find $if$ and $which$ numbers are multiple duplicates (numbers appearing more than once). The time complexity ought to be linear and space complexity constant (don't use any auxiliary data structure except the input vector and variables) and the input data set ought not to be modified.

I'm trying to solve this problem for at least one multiple duplicate (as a starting point, although I don't know if the solution to this case can be expanded to the general case of at least two multiple duplicates). Here's what I've done so far:

The $if$ part is relatively easy, taken we can be proven using the pigeonhole principle to show that there $will$ be at least a simple duplicate within the given data set.

The $which$ part, on the other hand, is what gives me headaches. Unlike the missing number problem, we cannot simply Xor the numbers. Neither can we sum them up and subtract the sum from 1 to n-1, thus finding a simple duplicate. Can anyone suggest a good approach towards this problem? You are not allowed to modify the input array.

$\begingroup$You know all the numbers will be in the range $[1, n-1]$. If you come across a value $1 \leq A[i] \leq n-1$, try to think of a way that you could "mark" it using the space you already have. By "mark" it, I mean, make a note somehow that the value $A[i]$ has already been seen. Then if we come across the same value again, say $A[j] = A[i]$, we can check if the value $A[j]$ has been marked to see if it is a duplicate. So you need to figure out a way to "mark" a value in-place (O(1) space). The key here is to note that $A[i]$ will be in the range $[1, n-1]$ (use this to your advantage!)$\endgroup$
– ryanNov 2 '17 at 21:12

$\begingroup$@ryan Since we know the search space is $[1, n-1]$, then I suppose we can "mark" the value to anything over n (some kind of mapping by modifying the array?). Although it seems to me that your suggestion requires looking "back" in the array.$\endgroup$
– theSongbirdNov 2 '17 at 21:28

$\begingroup$If we know $1 \leq A[i] \leq n-1$, what if we could "mark" it by somehow utilizing the location in the array that $A[i]$ indexes to? For example, We could put some extra info in the value at $A[A[i]]$ to say we've already seen $A[i]$. You would need to make sure not to mess up the true value at $A[A[i]]$ (or at least be able to recover it somehow).$\endgroup$
– ryanNov 2 '17 at 21:40

$\begingroup$@ryan Well we could modify the sign bit (given we have signed data, if not modify the "virtual" location of the sign bit) and this won't consume unnecessary space. How does this sound? (although by modifying sign bits we can't really tell wether we use this to count duplicates)$\endgroup$
– theSongbirdNov 2 '17 at 21:50

$\begingroup$Why couldn't we use this to count duplicates? Let's say we have $A = \{3, 3, 1, 2, 4\}$. We encounter $3$ so set $A[3] = 0 - A[3]$, we get $A = \{3, 3, -1, 2, 4\}$. We encounter the next $3$, so we first check the value of $A[3]$, if it's negative we know it's a duplicate. If it's positive then we set $A[3]$ to negative and continue.$\endgroup$
– ryanNov 2 '17 at 21:54

1 Answer
1

Let's think first about algorithms that read the entire input array before producing any output. Note that there are exponentially many possible outputs. In particular, the output might include up to $n/2$ different numbers that are "multiple duplicates". Thus, there are at least ${n-1 \choose n/2}$ different possible outputs, and ${n-1 \choose n/2}$ is $\Theta(2^n/\sqrt{n})$. Now the state of the algorithm (the values of all variables, etc.) before producing output must uniquely determine the output, so that means there must be at least ${n-1 \choose n/2}$ possible states. It requires $\lg {n-1 \choose n/2}$ bits to store that state. It follows that any algorithm that reads all input before producing all output must use at least $\lg {n-1 \choose n/2}$ bits of space; and this is $\Theta(n)$. So any such algorithm must use linear space.

What about algorithms that might produce output as they go? It's still not possible in constant space; such algorithms still need linear space. Consider an input where the first half of the array has no duplicates, and think about the intermediate state of the algorithm at that point. That state must contain enough information to deduce the set of numbers in the first half of the input array (if two such input arrays that differ in their first half produce the same state at this point, then we can construct a way to fill in the second half so that the algorithm is wrong for at least one of those -- just pick a number that appears in the first half of one of those inputs but not the other, and make sure it appears exactly once in the second half). It follows that there must be at least ${n-1 \choose n/2}$ possible states after reading the first half of the input array. As before, it then follows that the algorithm needs $\Theta(n)$ bits of space. So producing output as you go doesn't help -- it still can't be done in constant space.