A box contains n balls coloured 1 to n. Each time you pick two balls from the bin - the first ball and the second ball, both uniformly at random and you paint the second ball with the colour of the first. Then, you put both balls back into the box. What is the expected number of times this needs to be done so that all balls in the box have the same colour?

I think this question is borderline, but I would like to keep it open. It's borderline because you already know the answer, and that makes this more of a puzzle than a research question. On the other hand, I find "is there an elegant proof of this pretty combinatorial formula" to be a reasonable type of question.
–
David SpeyerOct 12 '10 at 23:54

3

I first heard this problem at Mathcamp 2001. I believe the problem was invented there by Dave Savitt. I recall verifying it up to n=10 using linear relations among variables, one for each partition of n. Eventually, Dave and John Conway found a proof for all n. Their proof gave an explicit formula for the expected number of steps from any starting position, not just all distinct, and had a trivial inductive proof. IIRC, the formula involved harmonic numbers. While Ori Gurel-Gurevich's solution is very nice, I wonder if anyone can find the formula for all starting positions, which I have lost.
–
aorqOct 13 '10 at 9:38

3

I certainly didn't invent the problem. IIRC I heard about it at a colloquium tea in the MIT common room (not 100% sure from whom, but maybe Greg Warrington?), put it aside at the time, and brought it out as a challenge problem at Mathcamp later that summer. The rest of Rex's story is accurate, I think.
–
D. SavittOct 14 '10 at 2:16

Of course, the mathematical solution is simple (though not elegant). I would like to see a more elegant solution.
–
Pratik PoddarOct 14 '10 at 6:22

6 Answers
6

It can probably be done by looking at the sum of squares of sizes of color clusters and then constructing an appropriate martingale. But here's a somewhat elegant solution: reverse the time!

Let's formulate the question like that. Let $F$ be the set of functions from $\{1,\ldots,n\}$ to $\{1,\ldots,n\}$ that are almost identity, i.e., $f(i)=i$ except for a single value $j$. Then if $f_t$ is a sequence of i.i.d. uniformly from $F$, and
$$g_t=f_1 \circ f_2 \circ \ldots \circ f_t$$
then you can define $\tau= \min \{ t | g_t \verb"is constant"\}$. The question is then to calculate $\mathbb{E}(\tau)$.

Now, one can also define the sequence
$$h_t=f_t \circ f_{t-1} \circ \ldots \circ f_1$$
That is, the difference is that while $g_{t+1}=g_t \circ f_{t+1}$, here we have $h_{t+1}=f_{t+1} \circ h_t$. This is the time reversal of the original process.

Obviously, $h_t$ and $g_t$ have the same distribution so
$$\mathbb{P}(h_t \verb"is constant")=\mathbb{P}(g_t \verb"is constant")$$
and so if we define $\sigma=\min \{ t | h_t \verb"is constant"\}$ then $\sigma$ and $\tau$ have the same distribution and in particular the same expectation.

Now calculating the expectation of $\sigma$ is straightforward: if the range of $h_t$ has $k$ distinct values, then with probability $k(k-1)/n(n-1)$ this number decreases by 1 and otherwise it stays the same. Hence $\sigma$ is the sum of geometric distributions with parameters $k(k-1)/n(n-1)$ and its expectation is
$$\mathbb{E}(\sigma)=\sum_{k=2}^n \frac{n(n-1)}{k(k-1)}= n(n-1)\sum_{k=2}^n \frac1{k-1} - \frac1{k} = n(n-1)(1-\frac1{n}) = (n-1)^2 .$$

I do not take the point. Why $\sigma$ and $\tau$ have the same distribution? Equality of distributions at each time (not common distributions!), does it imply the equality oа distributions of the first stop? I do not think so. But hopefully we may fix it: expectation of $\sigma$ equals $\sum_{n=1}^{\infty} prob (\sigma\geq n)=\sum \prob(h_{n-1}\ne const)$, and in thi last we may replace $h$ to $g$.
–
Fedor PetrovOct 13 '10 at 8:33

5

Aha! I now understand Ori's answer. At time $t$, considering all steps from step $t$ to the end, there will be $k$ balls whose colors are mapped to all the other balls at the end. Considering time step $t-1$, the only way to reduce $k$ is to choose two of these $k$ influential balls, and have the color of one mapped to that of another. This gives the recursion in his answer. Very nice, although it could be explained better.
–
Peter ShorOct 13 '10 at 13:13

Sorry I am a bit confused: 1. what is the domain and codomain of each f? Are we talking about a map between balls and colors, or just a transition between colors? Without this being clear I don't see why introducing the time reversal process is really magic. 2. In Peter's comment above, what are the ``influential balls"? At time t, if we define the influential balls to be the balls whose current color eventually wins in the end, then at time t+1 this number could increase, decrease or stays the same. Why do we have a geometric distribution?
–
Ying ZhangJun 27 '14 at 16:45

3. The choice of f being uniform from F at each step? Again this has to do with the definition of f. But this probably doesn't matter as long as the other steps make sense.
–
Ying ZhangJun 27 '14 at 16:47

Consider just those sequences of selections that result in the final colour being $c$. If at some point during a sequence we have $k$ of the balls being this colour, we can define $E_k$ as the expected number of selections from here before all the balls are coloured $c$.

Doing this, we need to take account of the fact that not all selections are equally probable: each selection must be multiplied by the probability that it results in $c$ being the eventual colour. Happily, this probability is simply $k'/n$, where $k'$ is the number of balls coloured $c$ after this selection.

Fix $n$, so that I don't have to include it in my notation. In all other respects, copy Aaron's notation. JBL observes that there appear to be numbers $f(1)$, $f(2)$, ..., $f(n-1)$ such that
$$ p_{\lambda} = \sum_{k \in \lambda} f(k). \quad \quad (*)$$

We will show that such $f$'s exist, and give formulas for them. In particular, it will be clear that $f(1) = (n-1)^2/n$, proving the result. For convenience, we set $f(0)=f(n)=0$.

Our proof breaks into two parts: showing that there is a unique solution to $(**)$ and showing that the resulting $f$'s obey $(*)$. We do the second part first.

We must establish the Markov relation:
$$\sum_{k \in \lambda} f(k) = 1 + \sum_{\mu} p(\lambda \to \mu) \sum_{k \in \mu} f(k).$$
For any $k$ in $\lambda$, the modified partition $\mu$ contains either $k-1$, $k+1$ or $k$, depending on whether we lost a ball of the corresponding color, gained one, or kept the same number.
The probabilities of these events are $k(n-k)/n(n-1)$, $k(n-k)/n(n-1)$, and $1-2 k(n-k)/n(n-1)$, respectively.

For n=3, one turn takes you 2 of one color and 1 of another. To get to a single color you need to pick one of the 2 (prob=2/3), then pick the odd one (prob=1/2). So the expected number of turns is 1+3=4

edit I indeed was mistaken. The suggested answer is correct as far as I checked (n=10). It is easy enough to set up a system of equations for the expectations and compute. Here are the results to 6, the notation should be clear enough:

I agree with Ross Millikan's comment below. I have verified the claimed formula upto n=4. The only approach I can imagine for this problem is to draw out the Markov chain explicitly and find the expected time it takes for the chain to hit the one non-transient state.
–
HedonistOct 12 '10 at 23:48

I also confirm 4 for n=3. It's 1+1+2/3 +(2/3)^2+(2/3)^3 + .... I think you might have taken the ratio in the geometric series to be 1/3 be mistake.
–
David SpeyerOct 12 '10 at 23:54

1

It seems very likely that there exist functions $f(n, k)$ such that $p_{\lambda} = \sum_{i} f(|\lambda|, \lambda_i)$. For example, it appears that $f(n, 1) = \frac{(n - 1)^2}{n}$ and $f(n, 2) = \frac{(2n - 1)(n - 2)}{n}$. Probably a little more data would be enough to guess the general form and proceed as in A. Rex's comment on the question.
–
JBLOct 13 '10 at 13:20