Personal thoughts on maths and programming, often illustrated with code snippets

Menu

The cereal box problem: How many boxes does it take to find all prizes?

Children’s cereal manufacturers often attract the attention of young clients by including small prizes and toys in every box; sometimes all prizes are identical, but most often individual prizes are part of a collection, and kids are encouraged to collect them and try to complete a full collection. How long does it take ?

Simple probabilistic modeling shows that on average \(n (1 + 1/2 + \ldots + 1/n)\) boxes are required to complete a full set of \(n\) prizes: for example, it takes on average \(14.7\) boxes to complete a full set of six prizes.

Numerical and experimental simulation

The expected number of boxes to buy before finding all \(n\) prices is shown below:

This histogram shows the calculated average number of purchases required to accumulate at least one copy of each prize, for various sizes \(n\) of prizes sets: for example, finding all prizes in a series of twelve requires on average 37.24 purchases. Note that the curve is somewhat steeper than linear, due to a multiplicative logarithmic factor \(\left(1 + \frac12 + \frac13 + \ldots + \frac1{n}\right) \sim \log(n)\)

And here is an experimental simulation for \(n = 6\): the graph below shows how many purchases were required to obtain six prizes out of six, over 1,000,000 simulations.

This histogram shows the experimental probability distribution of \(T_6\); that is, of the number of cereal boxes that one needs to purchase before finding all prizes in a series of six. Interestingly, despite the peak at \(t = 11\), the distribution’s long tail offsets the average to \(14.65\) in this experiment — close to the theoretical prediction of \(14.7\).

Calculations and simulation code

Theoretical average

Assume there are \(n\) different prizes, all of which are equally distributed in cereal boxes. Model buyer behavior as a sequence of purchases, and call \(T_k\) the time at which they find the \(k\)th prize : clearly \(T_0 = 0\) (no purchases are needed to find 0 prizes) and \(T_1 = 1\) (one finds their first prize immediately after buying their first cereal box). Naturally, \(T_n\) denotes the time required to find all prizes.

Note that, as expected, \(\E{\Delta_k}\) quickly increases as \(k\) grows larger: the more prizes have already been found, the harder it is to get a new one, to the point that finding the last prize requires \(n\) purchases on average.

Numerical simulations

The experimental data above was simulated using the following code:

repeat = 1000*1000
time_to = defaultdict(Counter)
for _ in range(repeat):
purchased = 0
collected_prizes = set()
while len(collected_prizes) < n: # Until all prizes have been found
purchased += 1 # Purchase one more box
prize = random.randint(1, n) # Open it
if prize not in collected_prizes: # Is it a new prize ?
time_to[len(collected_prizes) + 1][purchased] += 1 # Yes, count it !
collected_prizes.add(prize) # And add it to list of already found prizes

Conclusion

One might want to find some intuitive way to comprehend the formula derived above, \(\E{T_n} = n\left(1 + 1/2 + \ldots + 1/n\right)\). A relatively easy way to get such an intuition is to summarize the calculation above, realizing that each new packet bought after finding the \(k\)th prize brings an extra prize with probability \((n-k)/n\) ; since each purchase is independent, it takes on average \(n / (n-k)\) purchases to find a new prize; then summing over \(k\) yields the final formula.

Have you ever completed a full set of cereal box prizes? How long did it take?