On Fri, 28 Oct 2005, Johan Geldenhuys wrote:
> After I tested the previous code, I noticed that the odds is 1:49 that a
> duplicate number can be found in the 6 digit range (and it happended)
> and that 0 can also be found.
Hi Johan,
I'm not exactly sure how 1 in 49 are the odds of getting a duplication.
Let's talk about this for a moment.
[Warning: long post ahead.]
Just to make sure: the situation is that we're rolling a fifty-sided dice
six times, and we'd like to see what's the probability of having
duplicates in the mix. We have 50**6 possible rolls, but there are only:
50 * 49 * 48 * 47 * 46 * 45
ways of making sure we don't hit duplicates. The probability that we
don't hit any duplicates, then, is:
######
>>> (50 * 49 * 48 * 47 * 46 * 45) / (50 **6)
0L
#####
Oops, forgot about integer division. *grin*
Let me try that again:
######
>>> (50.0 * 49 * 48 * 47 * 46 * 45) / (50 **6)
0.73224345599999996
######
Ok, so we have about a 73% chance of not having any duplicate numbers if
we roll a 50-digit dice six times.
Since we can either have no duplication or duplication, then he
probability that we will have duplication is:
(1 - probability of not having duplication)
######
>>> 1 - (50.0 * 49 * 48 * 47 * 46 * 45) / (50 **6)
0.26775654400000004
######
So we have about a 27% percent chance of getting duplication when we roll
a fifty sided dice six times.
But that's just probability theory. Is this real? Is this truly
probable?
Let's test this experimentally. First, we need some way of generating a
roll of dice. Here's a quick-and-dirty function to do that:
######
>>> def make_roll(n, r):
... return [random.randrange(n) for i in range(r)]
...
>>> make_roll(50, 6)
[11, 48, 39, 43, 30, 24]
>>> make_roll(50, 6)
[6, 47, 0, 34, 9, 38]
######
We can test things experimentally by writing a function to see a roll has
duplication:
######
>>> def has_duplication(roll):
... d = {}
... for number in roll:
... d[number] = d.get(number, 0) + 1
... for count in d.values():
... if count > 1:
... return True
... return False
...
>>> has_duplication([1, 2, 3])
False
>>> has_duplication([1, 2, 2])
True
######
Once we have this tool, then it becomes easy to do a trial run and see, if
we roll 10 times, what percentage of those are duplicates.
######
>>> count_duplicates = 0
>>> for i in range(10):
... if has_duplication(make_roll(50, 6)):
... count_duplicates = count_duplicates + 1
...
>>> count_duplicates
1
>>> 1.0 / 10
0.10000000000000001
######
Experimentally, we're seeing 1%, but that might just be a fluke. Or maybe
it's just because the trial size is much too small. *grin*
Let's formalize this test as a function, just to make it easier to retry:
######
>>> def do_trial(n):
... """Returns the number of duplicates if we do a n-trial."""
... count_duplicates = 0
... for i in range(n):
... if has_duplication(make_roll(50, 6)):
... count_duplicates = count_duplicates + 1
... return float(count_duplicates) / n
...
>>>>>> do_trial(5)
0.40000000000000002
>>> do_trial(10)
0.29999999999999999
>>> do_trial(100)
0.27000000000000002
>>> do_trial(1000)
0.26900000000000002
>>> do_trial(10000)
0.2621
>>> do_trial(100000)
0.26676
######
So we're experimentally getting evidence that the probability of rolling
duplicates is about 26%, if we have a 50-sided dice six times. Hey,
that's not bad: that's about the value we got from doing math.
As we scale our expeiment higher, we start seeing the Law of Large Numbers
taking place: when we use lots of trials, then our experimental results
start getting very close to the one we calculated with probability theory.
http://en.wikipedia.org/wiki/Law_of_large_numbers
Here's a punchline: the problem we've been looking at is another disguise
for the Birthday Paradox:
http://en.wikipedia.org/wiki/Birthday_problem
That article makes the assertion that if we have 23 people in a room, the
probability that we have at least two people with the same birthday is
pretty good (about 50%). We can modify do_trial() to experimentally see
that!
######
>>> def do_trial(n, x, y):
... """Returns the number of duplicates if we do a n-trial."""
... count_duplicates = 0
... for i in range(n):
... if has_duplication(make_roll(x, y)):
... count_duplicates = count_duplicates + 1
... return float(count_duplicates) / n
...
>>> do_trial(10000, 365, 23)
0.51390000000000002
>>> do_trial(10000, 365, 1)
0.0
>>> do_trial(10000, 365, 2)
0.0023
>>> for i in range(23):
... print i, "people, probability of shared birthday:",
do_trial(10000, 365, i)
...
0 people, probability of shared birthday: 0.0
1 people, probability of shared birthday: 0.0
2 people, probability of shared birthday: 0.0035
3 people, probability of shared birthday: 0.0087
4 people, probability of shared birthday: 0.0181
5 people, probability of shared birthday: 0.0273
6 people, probability of shared birthday: 0.0436
7 people, probability of shared birthday: 0.0551
8 people, probability of shared birthday: 0.0736
9 people, probability of shared birthday: 0.0932
10 people, probability of shared birthday: 0.1201
11 people, probability of shared birthday: 0.1325
12 people, probability of shared birthday: 0.1654
13 people, probability of shared birthday: 0.1967
14 people, probability of shared birthday: 0.2241
15 people, probability of shared birthday: 0.2581
16 people, probability of shared birthday: 0.281
17 people, probability of shared birthday: 0.3169
18 people, probability of shared birthday: 0.3497
19 people, probability of shared birthday: 0.3843
20 people, probability of shared birthday: 0.4171
21 people, probability of shared birthday: 0.438
22 people, probability of shared birthday: 0.4817
######
The variable names 'x' and 'y' suck (I'll try to think of better ones next
time), but I hope that it's clear what we're doing: we're making
do_trial() more general so it can handle different dice rolls. And
because they are parameters in our do_trial() function, we can then see
what the situation loosk like as we get more and more kids in the
classroom.
Does this make sense? If you have any questions about this, please feel
free to ask. Hope this helps!