Tuesday, April 14, 2009

Dear Dr. Math,I've heard that if I wanted to, ahem, "creatively adjust" some numbers, I should use numbers that start with the digit 1 more often. Why is that?Inquiring Re. Statistics

Dear IRS,

How timely of you to bring this up! Indeed, there is a general pattern in the digits typically found in measured quantities, especially those spanning many orders of magnitude, for example: populations of cities, distances between stars, or, say, ADJUSTED GROSS INCOME. The pattern is that the digit 1 occurs more often as the leading digit of the number, approximately 30% of the time, followed by the digit 2 about 18% of the time, and so on. The probability, in fact, of having a leading digit equal to d is equal to log(1+1/d), for any d =1,2,...,9. This rule is called Benford's Law, named (as is often the case) for the second person to discover it, when he noticed that the pages of the library's book of logarithms were much dirtier, hence more used, at the front of the book where the numbers began with 1. In pictures, the distribution of digits looks like this:

It seems counterintuitive that any digit should be more likely than any other. After all, if we pick a number "at random," shouldn't it have the same probability of being between 100 and 199 as it does of being between 200 and 299, etc.? If so, the probability of getting a 1 as the first digit would in fact be the same as getting a 2. However, this turns out to be impossible, and it has to do with a very common misconception about "randomness."

The fact of the matter is that there's actually no way to pick a number uniformly at random without further restrictions. So, for example, if I tell you to "pick a random number," it must be the case that you're more likely to select some particular number than some other (which ones, however, are up to you.) Assume this weren't true, so all numbers are equally likely. Just to be clear, let's focus on the positive integers, the numbers 1,2,3,... Now let p be the probability of picking any one of them, say the number 1. Since they're all supposedly equally likely, this means p is also the probability of picking 2, and of picking 3, and so on. So the chance of picking any number between 1 and 10, say, is 10*p. Since probabilities are always less than 1, this means p < 1/10. OK, well, by the same reasoning, the probability of picking a number between 1 and 1000 is 1000*p, so p < 1/1000. Similarly, p < 1/1,000,000, p < 1/(1 googol), and so on. In fact, it follows that p < 1/N for any N, and the only (non-negative) number that has that property is p = 0. Ergo, the chance of getting any particular integer is 0, from which it follows (for reasons I won't get into here) that the probability of picking an integer at all is 0, a "contradiction." That's math-speak for "whoops." You can only pick an integer uniformly from a finitesetof possibilities.

So, what do we mean when we say that a number is "random"? Well, there are ways for things to be random without being uniformly random. For example, if you roll a pair of dice, you might say the outcome is "random," but you know that the sum is more likely to be 7 than it is to be 2. Similarly, if you pick a person (uniformly) randomly from the population of the U.S. (note: the population is finite, so that's OK), you might model his/her IQ as a random quantity with a normal distribution, a.k.a. a "bell curve," centered around 100. The existence of different distributions besides the uniform distribution is the source of a lot of popular misunderstandings about statistics.

None of that explains where Benford's Law comes from, of course, but it's at least an argument why it's plausible that the distribution isn't uniform. To explain the appearance of the particular logarithmic distribution of digits I wrote above, we'd need some kind of model for the quantities we were observing, and it can't just be "the uniform distribution on the positive integers," because we already showed that there's no such thing.

One reasonable idea is that the thing we're measuring might be "scale invariant." That is, if it has a wide range of possible values, it might not matter what size units we use to measure it--we'll get roughly the same distribution of numbers. So if we imagine switching from measuring lengths in feet to measuring them in "half-feet,"* say, then anything that gave us a foot-length starting with 1, say 1.2 feet or 1.8 feet, will now give us a half-foot length starting either with 2 or 3, in this case 2.4 and 3.6 "half-feet." If the two distributions are the same, then the occurrence of a first-digit 1 must be the same as the occurrence of a first-digit 2 or 3, combined. By the same reasoning, any quantity initially beginning with a 5, 6, 7, 8, or 9 would now begin with a 1, when doubled. Similarly, by tripling the scale, measuring in "third-feet" and assuming the same invariance, we'd get a 1 as often as a 3, 4, or 5 put together. And so on. By considering every possible scale, this line of reasoning leads you pretty much straight to Benford's Law. This scale invariance kind of makes sense if we're measuring ADJUSTED GROSS INCOME, since incomes vary by so much (so very, very much), whereas something like height wouldn't exhibit scale invariance, being more tightly distributed around its mean.

Another perspective is that when we measure things, we're frequently observing something in the midst of an exponential growth. Exponential growth happens all the time in nature, for example, in the sizes of populations or SECRET OFFSHORE BANK ACCOUNTS with a fixed (compound) interest rate. The key feature of a quantity growing exponentially is that it has a fixed "doubling time." That is, the amount of time it takes to grow by a factor of 2 is independent of how big it is currently. For example, let's assume your illegal bank account (well not yours, but one's) doubles in value every year and starts off with a balance of $1000. At the end of year 1, you'd have $2000, at the end of year 2 you'd have $4000, at the end of year 3 you'd have $8000, and so on. So for the whole first year, your bank balance would start with the digit 1, but during the second year you would have some balances starting with 2 and some with 3. During the third year, you would have balances starting with 4, 5, 6, and 7. If we AUDITED your account at some randomly chosen time, we'd be just as likely to see a balance starting with 1 as a balance starting with 2 or 3, combined, and so on. In other words, we have the same "scale invariance" conditions as before, which lead us back to Benford's Law. The same would be true no matter how quickly the account grew; exponential growth sampled at a random time gives us a logarithmic distribution of digits.

To give you a concrete example, I went through the first 100 powers of 2--1, 2, 4, 8, 16, ...**--and instructed my computer to keep track of just the first digits. The results, as you can see, conform pretty nicely to Benford's Law:

For whatever reason, it appears that Benford's Law, like TAX LAW, is the law.