Sunday, January 29, 2006

Earlier this week a swapped a couple of emails with someone at Google involved with looking at web page statistics on the Google database. They probably have the one of the largestest corpus of textual data in the world and it's a mixture of data from a wide variety of sources. This makes it an ideal candidate for testing Benford's law and that's what I'm suggesting they do. The published tests I've seen so far are pretty small but it'd be fun to see the results from, say, a billion web pages. Coincidentally Boing Boing just had a linky to a New York Times story on it form 1998. Anyway...I asked the guy to email me if they carry out the test. It's probably pretty low down on their list of priorities. So of course I'll report back.

I'm not sure how anyone can say "nobody knows why" numbers starting with 1 are more common. I'm not a mathematician but it seems quite obvious to me.

As an example, there are only a few cities in the world with more than 10 million people, a handful with 5 million, and maybe 100 or 200 with 1 million. There are countless thousands of small communities with 1,000 people. As the numbers get bigger, the number of places with that number gets smaller.

So if we are talking six digits, more places will have 100,000-200,000 people than 200,000-300,000 people, and so on. This would be true of most things that require a number - smaller numbers are more plentiful.

Benford's Law does not seem to me to be all that complex to understand.

Assuming most numbers in this world are NOT random, that is to say, individually, they are part of a larger sequence, then 1 would be the most used number followed by 2 etc.

Taking addresses as an example, the address 239 XXX street is necessarily preceeded by 238, 237 and so on. Crucially, the addresses all START at 1 as do most ordered things (imagine giving the first person in a lineup the number 78364)and thus, there are more 1's than any other number used in our daily lives.

Benford's law solved. I now turn my attention to figuring out how ANYONE can like black licorice.

Most numbers out there are sequences, starting from 1 (why would you use random numbers?) There may be examples where numbers are more random (computer file sizes) but these will be heavily outweighed by the number of sequences out there.

Also, computer file sizes could be seen as something that increases sequentially. You write code, the file size gets larger. You add more code it gets larger still. You have to go through file size 1000 before getting to 2000 before getting to 3000. You stop at some point when the file is complete. Overall more will have stopped at 1000+ than at 2000+ than at 3000+ etc (because you have to go through these stages before reaching the larger sizes. Therefore there will be the logarithmic distribution exactly as shown.

Supermarket prices are the same. You have far more low priced items than high priced items. To become a high priced item you have to go through all the low prices first, therefore a prices is more likely to start with a 1 than a 2 than a 3 etc

The fact that you have to go through files of size 1000 to get to a file of size 2000 doesn't entail that there are more files of size 1000. For example, if I wrote a program to spew out millions of files of length 2000 I'd expect to see no files of length 1000 in my filesystem.

Supermarkets aren't forced to give products a low price which is then increased. New products come on the market all the time. Benford's law will apply to the set of all new products that have just come onto the market. And Benford's law would hold in market prices even if we were going through a period of deflation. Benford's law will hold for 1/price when there is inflation, ie. now.

None of what you say explains why as you grow the dataset the distribution gets closer and closer to precisely logarithmic as opposed to any other distribution that decreases with digit size.

'For example, if I wrote a program to spew out millions of files of length 2000 I'd expect to see no files of length 1000 in my filesystem.'

not really the best example is it? You are setting out to artificially make an exception that disproves the rule.

Yes, there may be cases where particular files are in a particular size range (for instance I deal with huge numbers of low res jpg images which will tend to a particular size of maybe 50000 - 80000k in size) but these are exceptions. The vast majority of files will grow in size with the amount of data they contain, and will vary enormously in size, but the large files will have had to have been small files before they became big files.

open a random folder on your computer and check the file sizes...

also, whenever a list of file sizes increases to the next 'power' (if that is the right term) it will automatically be tipped dramatically back towards confirming Benfords law

foe instance, if the file sizes increase steadily, as soon as they go from 9999k to 10000k, 10 times as many files from then on will start with a 1 than started with any earlier number. It will start to even out as the number increases, but the same thing will happen when it hits 100000k etc

as lists of numbers can stop at any point within this progression, overall the balance will be towards the lower numbers (as you alwasy have to have the lower numbers before the higher numbers)

I suspect supermarket prices are probably not a good example anyway, because prices will often be set at 99.99 or 99.95 rather than tipping it over to a psychologically higher number.

However, supermarket prices, and other exceptions, will be overwhelmingly drowned out by most number sequences which do not have these artificial effects on their progression.

> You are setting out to artificially> make an exception that disproves the> rule.

Exactly. That's the whole point of a counterexample - to contrive a case that demonstrates that a rule doesn't hold.

Sequences don't imply a logarithmic distribution. If I toss a coin 1,000 times and count how many times I get heads I have a sequence. I'll start with one head, go to two, then three and so on. The final distribution is approximately Gaussian centred on 500, nothing like a logarithmic distribution.

Conversely, logarithmic distributions don't imply sequences. For example, the reciprocals of prices (ie. 1/price) in a supermarket also have a logarithmic distribution and get you can't argue that recirprocal prices form a sequence.

So there is no necessary connection between a logarithmic distribution and having a sequence.

If you search Benford's Law on, of all places, Google, you'll get an explanation of why the distribution is exactly the way it is. When I searched it was the first link, but such things change.

The basic principles discussed here are correct but they don't explain why in a large enough distribution there are about 30% 1's instead of, say, 25% 1's. The website I found (http://mathworld.wolfram.com/BenfordsLaw.html) uses calculus to develope a formula:

log10(1+1/D)where D is the first digit.

Now I'm not saying that I understand how the proof works, as I'm a little rusty on such things and it wasn't very detailed, but I almost understand it, so any claims that nobody knows why the law works stike me as quite doubtful.

Also, it doesn't work with random numbers or small sets heavily influenced by psychology, it is a distribution that works better the larger and more diverse the sample, particularly with numbers that arise naturally. Counterexamples are irrelevant unless they fall on a scale where this is supposed to work.

> but I almost understand it, so any claims that nobody knows why the law works stike me as quite doubtful.

The problem is that the argument on the Mathworld web site isn't a valid proof for two reasons that the article itself points out: (1) P(x)=1/x cannot possibly be a probability distribution as its integral diverges (ie. the sum of probabilities is infinite, not one) and (2) Benford's law applies to mixtures of data from a variety of sources. The resolution of these issues only came in 1996.

> Counterexamples are irrelevant unless they fall on a scale where this is supposed to work.

I don't know what you mean by "scale" here. I could generate a trillion files, each a billion bytes long and there'd be no files less than a billion bytes long. So I presume you're not talking about either the size of the numbers or the number of numbers.

The most important reason is that numbers are commonly used as measurements of quantities. Think about this example, try it in excel if you like: Take an amount of money as a number, say $1,000. Repeatedly increase it by 5% until you get to $10,000 now count the number of numbers that start with each digit.Yo will get 15 1's, 8 2's, 6 3's, 4 4's, 4 5's, 3 6's, 3 7's, 3 8's, and 2 9's

In the range of amounts $1,000-$9,999 the numbers in the $1000-$2000 range are more prevalent because this is actually a bigger proportion in terms of the % changes. As a lot of economic numbers (the article is about the IRS, and tax returns, not phone numbers or other lists) change in a 'percentage' way not a fixed increase (price rises, wages, stocks, production, etc.) this distribution is inevitable.

You describe a process that gives you a logarithmic distribution (the correct one!) for a similar reason to Benford's law, but it isn't *exactly* the same. For example, Benford's law will hold if you look at something like the densities of materials in a chemical data book. These aren't dynamic values that increment repeatedly by a percentage.

So random, yet so interesting. I am not a mathematician either, but I did stay at a Holiday Inn Express last night.

It seems to me that ultimately this could be explained by adding in the human element here -- statistics of human behavior and perception are very often observed to fit neatly into one or more power laws, many of which are logarithmic.

See Wikipedia, where "[power laws] appear to fit such disparate phenomena as the popularity of websites, the wealth of individuals, the popularity of given names, and the frequency of words in documents..."

"Author Philip Ball has argued that the same power law relationships that are evident in phase transitions also apply to various manifestations of collective human behaviour."

If I remember my statistical analysis of human behavior class correctly, perception of stimuli, including sound waves, is most often expressed logarithmicly.

For instance, the perceived difference between a sound wave of 10db and a sound wave of 100db is actually the same as the difference between waves of 100,000db and 1,000,000db.

Because of this, the volume dials on most stereos are actually calibrated acording to a logarithmic scale, e.g. an arbitrary volume setting of "2" on your speakers has an actual amplitude difference of 10x the volume of "1".

Here, we have simply adjusted the means by which we are measuring the sound -- the volume dial -- to best fit our perception.

In this way, we might assume that other systems by which we measure and record data also reflect this, including the decimal system of numbers, which is incidentally a base-10 system where position of a numerical symbol expreses the value of that symbol in terms of the exponential values of the base.

I am not suggesting that a table of molecular weights or sizes of computer files are somehow 'biased' by our perception, just that the systems of measurement and numbers used to express data are really just human constructs adopted acording to their utility.

This utility would in theory include the representation of the world according to logarithmic perception. I could be totally off base, but it seems to me that, like the volume dial, the decimal system is just another way that we as perceiving entities have devised to express the world around us.

There is some truth to what you say - partly for the obvious reason that we are talking about first digits, which is obviously an artifact of our number system. A culture that used a logarithmic system, on the other hand, might find that the equivalent of their first digit was uniformly distributed.

BTW One of my pet irks is that many mp3 playing applications don't mimic the logarithmic volume controls on real stereos meaning that the useful range of volume control is all compressed at one end.

PS How come everyone suddenly stumbling on this year old blog entry of mine?

Benford's Law is fascinating, and makes more sense as you examine it further.

For denotations of time, the leading numberal one is highly favored, since four of the 12 possibilities start with that number (1, 10, 11 & 12)

On military times, however, it's skewed toward zero, obviously since ten out of 24 will start there. In actuality, most usage of military times will probably front-load with normal duty hours, say between 0700 to 1700, fiddling it back a bit, since logbooks and such will have more entries during those hours, and more activities take place during those hours.

Smaller amounts are more common in our world. Why, you ask? Larger objects/amounts are harder to maintain and harder to come by, because of the amount of space required to contain it and the amount of energy required to handle it.