archive

August 12, 2010

Ts and Hs

Imagine I'm a professor teaching the basics of probability. One day, I give the class this assignment: sit down at home, toss a coin a hundred times and write down each result, Heads or Tails. Bring me that sequence of Hs and Ts tomorrow.

Now I suspect most of the class is too lazy to toss a coin a hundred times. It's likely a lot of them will simply write out what looks like a random sequence of Hs and Ts and turn that in.

But I am almost sure that, with a quick look at each submitted sequence, I will be able to tell which ones were produced in this fake way, and which ones are actually a faithful record of a hundred coin tosses.

Kovendhan has a comment with the right answer: "In order to appear genuine, students will avoid long sequences (6 or more) of consecutive heads or tails. In reality, such long sequences almost always occur in long trials."

Which of course begs the bonus question: in a sequence of 100 coin tosses, what's the probability that there will be at least one sequence of 6 heads or tails?

All right, bonus question #2: what if I changed the "6" immediately above to "6 or more"?

At any step, the probability that there will be 6 consecutive heads or tails is (0.5)^6 = 0.016. The probability of the above event not happening at a particular step is 1 - 0.016 = 0.984.The probability of the event not happening in any of the first 94 steps is (0.984)^94 = 0.219.The answer is therefore 1 - 0.219 = 0.781.

The answer is the same for "6 or more". For more than 6 heads (or tails), you need to get 6 in the first place.

Minor quibble: Kovendhan's answer is good, but is an approximation since the different sets of 6 successive tosses in 100 are not independent. Eg, "HTTTTT" is a bad sequence but there is a 50% chance that the next 6-mer will be "TTTTTT". So it somewhat underestimates the probability of getting 6 or more heads. My numerical experiments suggest the true probability is above 0.8 (sorry, offhand I don't see an easy way to calculate it exactly).

He is also right that "six or more" is the same as six, unless you meant exactly 6, which was not obvious from your statement.

Major quibble: This leaves a nearly 20% chance that no such sequence will occur, which is hardly a reliable way to distinguish a fake answer. (If the entire class were honest, you'd unfairly mark one in five as a cheat). If you look for 5 or more heads/tails, there is a less than 5% of that not occurring, which is still not very convincing. Plus, students could well put in five or more by chance.

In fact you can very reliably determine human-generated sequences from truly random sequences, but you have to be more sophisticated than this. A combination of several statistical properties of random sequences can do it (even if each property has a 20% chance of occurring in an honest sequence, if you test 10 such properties the chances of *all* of them being satisfied in a random sequence are vanishingly small.)

Lastly -- if I were a lazy student, I'd use a pseudorandom number generator to get the sequence...

Let me address Rahul's quibbles. There is more to the minor quibble. It could well be overestimating too, for a sequence like HTHTTH cannot generate 6 heads or tails with the next toss.

The major quibble isn't that big an issue I think. I think that a student would try to "enforce" randomness, and I am not sure if he would put in a sequence of heads/tails of length more than 3. A visual inspection would be more than sufficient. One could also count the number of 4-sequences and 5-sequences. I feel that these might not appear at all/appear very infrequently in the concocted random sequence.

Let me give the promised story. It's about a famous Computer Scientist/Mathematician who has a penchant for gadgets. He once constructed this gadget which, on the press of a button, would generate a random 2-digit number. He was using this to generate a long random sequence of 2-digit numbers for an experiment. He was copying down the numbers after each button press.

After he had reached sufficient length, he looked at his piece of paper and thought a while. He realized his folly and threw the paper into trash. Basically, his list of numbers had no consecutive repeats, whereas if the numbers were truly random it should have had many such repeats. The issue was not with the gadget. While taking down the numbers, when he pressed the button, if he saw the same number, he thought that he had forgotten to press, or not pressed the button hard enough for it to register. And pressed it again to get a fresh number. So every natural repetition of the numbers didn't show up in his final list. It's interesting to note how our perceived notion of randomness is different from the actual notion, and can lead us to interesting actions.

I think a much simpler intiution for the non-mathematically inclined reader is that its very hard to manufacture a fair coin. Diaconis places special orders to get his fair coins manufactured in the lab. Ergo, every coin out there is unfair. Ergo, you'll get long series of H's & T's if you use a real coin. Most students don't know of these things & will simply intersperse H's & T's.

While I admit the veracity of the quibbles (I'm ashamed of that sentence!), the point is not to definitively identify what's a genuine sequence and what's not, but what subruk gets at above. Somebody concocting what she thinks is a random sequence of 100 H and T will almost certainly not put down 6 H or 6 T in a row (and for that matter, probably not 5 or 4 in a row either). Because we think to ourselves that that is "not random".

It would be interesting to see someone knowledgeable about probability concoct such a sequence, and then subject that one to tests of "randomness". What signs would such a person leave, if any, of it being concocted?

Sukrut is right, it is not all that obvious whether 0.781 is an overestimate or underestimate (but the numerics say it is an underestimate).

I think I can see how to calculate it exactly but it requires careful counting: will update if I get around to doing it later.

I think a generic human-generated sequence can be detected by "eyeballing" (if the one with the eyeballs has some expertise). Whether a human can train himself to generate random sequences that fools an expert at a glance, I don't know. I'm pretty sure that nobody can generate large sets of random numbers that would fool the statistical tests...

OK, my exact answer for "the probability of 6 or more successive heads (or tails) in 100 tosses" is1-(1481219/1614480)^19= 0.8054 approximately.This is in very good agreement with what I got numerically and not very different from Kovendhan's approximate answer.

But the explanation probably requires a blog post in itself, which I may write in a day or two. Judging by that formula, there is no simple way to obtain it -- unless (a) I missed something and my answer is therefore an approximation too (which is entirely possible); and (b) the real answer is much easier to obtain (which I doubt).

Anon: why would that answer be right? It looks like the probability that the first six are heads (or tails). Or a specific six. And even then, it is six or more - not exactly six.

And no, multiplying by 95 (the number of ways you can pick 6 successive tosses) doesn't work and in fact gives you a number greater than 1. Figuring out why is an exercise for the reader.

The probability that there is exactly one instance of exactly six successive heads (or tails) somewhere in the sequence isn't any easier than the 'six or greater' problem: one can do it but I don't think it will be very illuminating.

as - thanks, very interesting. Perhaps Feller's textbook derives the formula (but I wonder if it will be an example of Jaynes' complaint about Feller's approach: that Feller, being a clever man, derives most of his results using some beautiful trick or other, and therefore the student concludes that probability theory can only be done by clever people using clever tricks and not by following some basic principles systematically).

Note that this still doesn't answer the question of 6 or more successive heads OR tails. The answer to this isP(6+ heads) + P(6+ tails) - P(6+ heads AND 6+ tails).The formula you gave can be used for the first two terms, but not the third.

Not being a mathematician/statistician, I'd thought this would be very difficult. And no wonder, it is! But coincidentally, was recently taught about binomial & Poisson distributions, and I guess one has to apply the equation for the latter.

But one thing I can point out is, you can only suspect if a given sequence is fake, but never be certain! ;)