Monday, June 05, 2006

More AiG lying with statistics: John Woodmorappe and Noah's Ark

Today's sample of bad math is something I talked about way back in my talk.origin days in usenet; and which has also been discussed by lots of other people. I thought it was too old, too well covered to be worth taking time and space here, but I keep having people forward it to me, suggesting I post something on it here. So who am I to stand in the way of the desires of the readers of this blog?

Back in 1996, a guy publishing under the pseudonym of "John Woodmorappe" wrote a book called "Noah's Ark: A Feasibility Study." AIG has an article discussing the basic ideas of the book here, and it is this article that I'll be using as the focus of my post.

Woodmorappe's text purports to be a details study of whether or not it was technically feasible for the biblical story of the ark to be true. He tries to demonstrate the possibility of truth by purportedly showing how all of the animals could fit on the ark, and proposing mechanisms by which they could be fed, wastes could be disposed of, etc.

The AiG article focuses on that first issue: is it possible that the ark, as described in the bible, have enough room to fit all of the animals? They admit that the answer to this comes down to two questions:

How many animals needed to be on the ark?

How much space was needed for each animal?

They play a few games to try to reduce the number of animals that they need to make space for, ending up with 16000 individual animals. The reasoning to get that number of pretty questionable, but it's not worth quibbling over, because of what comes next.

Woodmorappe wants to figure out how much space is needed by 16000 animals. How does he do that? Easy. He figures out an average amount of space needed by each animal, and multiplies it by the number of animals. Doesn't sound bad as a way of making an estimate, right?

Except... How does he figure out how much space is needed by each animal?

He figures out the median size of the animals in the ark; and determines that the median-sized animal would need a space of .5 by .5 by .3 meters - that is less than 1/10th of a cubic meter per animal. Multiply that by the number of animals, and you come up with 1200 cubic meters. Not terribly much space at all - which would mean that there would be plenty of room for food for the animals.

So what's wrong?

There are two different figures in statistics that can be used as an 'average' value, the mean, and the median. The mean is what we normally mean by average: take all of the values, add them up, and divide by the number of values. The median is a way of selecting a "typical" individual to represent the average: take all of the values, lay them out in order, and pick the value in the middle.

Woodmorappe uses the median. The median represents a "typical" individual; it's the value in the center. The nice thing about the median is that it gives you an idea of what a typical individual is like, without having it shifted by outliers in your data. But it's useless for trying to extrapolate to things about the population - because the outliers are part of the population; you can't eliminate them from your calculations about the population.

The median size of my pets? Well, in order they were (10, 15, 18, 18, 20, 20, 78); so the median was 18 lbs.

The mean size of my pets? (10+15+18+18+20+20+78)/7 = 25.5lbs.

If I use the median to reason about the group of pets, I'd conclude that all of my pets together would weight 126 lbs. But that's not true: they would weigh 179lbs! I'd be off by 54 lbs! Because by using the median instead of the mean, I've ignored the outlier.

That's exactly what Woodmorappe does. He concludes that the median animal is the size of a rat, and a rat, 1/10th of a cubic meter is more than enough space. But he's very deliberately eliminated the effects of elephants, dinosaurs (which Woodmorappe claims were on the ark!), horses, cows, antelope. They're all outliers in his data. By using the median, he's been able to reduce the space requirement by a huge factor - by at least something between one and two orders of magnitude!

This isn't sloppiness. No one would reasonably even compute a median size for something like this. This is a deliberate attempt to lie with numbers: by taking advantage of the fact that most of his readers aren't going to notice that he uses the median rather than the mean for an average (or in fact by relying on the fact that most don't even know the difference), he's lying to them to help them reach the conclusion that they really want.

Quite simply, Woodmorappe deliberately makes a mathematical error, and then uses that error to lie.

Saturday, June 03, 2006

Site changes complete

Sorry for the burst of index posts for those of you reading via RSS aggregators. The site changes are done now; there's a menu in the sidebar linking to the topic indices, and I've added a couple of links to the (still too small) blogroll. I might still make a polish change or two to the site, but nothing that should trigger any new RSS updates. Please let me know if you happen to notice any problems with the indices or sidebar.

Reader poll about topics; and some site maintenance stuff

I'm going to be doing a bit of work on the site this weekend to try to create a couple of indices for some of the series of posts that I've written. (After three months, I'm starting to realize that the site could use a bit or organization :-) ) So there's going to be a flurry of activity around here that's not terribly interesting, but it will make things easier to find as a result.

I'm also trying to decide what subjects I should write about next. I've got a few possibilities in mind, and I thought I'd ask you folks what you're interested in. Let me know which of the following interest you most; or feel free to "write in" your own suggestions.

Topology. I'm far from an expert on the subject, but it's an interesting one, and I'm willing to study up and write about it, so long as people are patient with me when I inevitably make mistakes.

More group theory. I've got enough time to be able to continue the series of posts on group theory.

Modal logics. In the whole lambda calculus thing, I mentioned the idea of modal logics. There are some really fascinating modal logics (I particularly like temporal and linear logics) that allow you to reason in different ways.

Graph theory. Graph theory is a fascinating area of discrete math that think is a lot of fun.

Category theory. Category theory is similar to group theory; it's one of those fields based on abstracting something down to its bare essentials. In cat theory, it's the concept of a function as a mapping from one thing to another.

Friday, June 02, 2006

As you must have heard, this week, the US Department of Homeland Security released information about the allocation of anti-terrorism funds to US cities. New York City, which is about 15 miles south of where I'm sitting right now, had its funding cut by 40%. The mayor of NYC pointed out the stupidity of this, saying "Whenever we catch someone involved in a terrorist plot, they've got a map of NYC in their pocket".

More information about the allocation has gradually been getting out. It turns out that the allocation method was remarkably simple. In their applications for funding, cities listed assets that they needed to protect. What DHS did was take the number of listed assets from all of the cities that were going to be recipients of funds, and give each city an amount of funding proportional to the number of assets they listed.

So, the Empire State building is equal to the neighborhood bank in Omaha. The stock exchange on Wall Street is equal to the memorial park in Anchorage, Alaska. Mount Sinai hospital is equal to the county hospital in the suburbs of Toledo, Ohio. The New York subway system (18.5 billion passenger-miles per year) is equal to the Minneapolis transit system (283 million passenger-miles per year). The Brooklyn Bridge is equal the George Street bridge in New Brunswick, NJ.

We know, perfectly well, how to estimate relative values and risks - the insurance company does it every day. Did our government do that? No. They ignored the established math of how to do it right, in favor of a stupid, a mind-bogglingly stupid, a deliberately stupid formula that would screw over every single actual high-risk location in the country, so that they could shove the money at people who don't need it.

Finally, Modeling Lambda Calculus: Programs are Proofs!

Now that we've worked through intuitionistic logic and it's model; and we've worked from the untyped lambda calculus to the simply typed lambda calculus; we're finally ready for the model of lambda calculus. And this is where it gets really fun.

Think about the types in the simple typed language calculus. Anything which can be formed from the following grammar is a lambda calculus type:type ::= primitive | function | ( type )primitive ::= A | B | C | D | ...function ::= type -> type

The catch with that grammar is that you can create type expressions which, while they are valid type definitions, you can't write a single, complete, closed expression which will actually have that type. (A closed expression is one with no free variables.) When there is an expression that has a type, we say that the expression inhabits the type; and that the type is an inhabited type. If there is no expression that can inhabit a type, we say it's uninhabitable.

So what's the difference between inhabitable type, and an uninhabitable type?

The answer comes from something called the Curry-Howard isomorphism. For a typed lambda calculus, there is a corresponding intuitionistic logic; a type expression is inhabitable if and only if the type is a theorem in the corresponding logic.

Look at the type "A -> A". Now, instead of seeing "->" as the function type constructor, try looking at it as logical implication. "A implies A" is clearly a theorem of intuitionistic logic. So the type "A -> A" is inhabitable.

Now, look at "A -> B". That's not a theorem, unless there's some other context that proves it. As a function type, that's the type of a function which, without including any context of any kind, can take a parameter of type A, and return a value of a different type B. You can't do that - there's got to be some context which provides a value of type B - and to access the context, there's got to be something to allow the function to access its context: a free variable. Same thing in the logic and the lambda calculus: you need some kind of context to establish "A->B" as a theorem (in the logic) or as an inhabitable type (in the lambda calculus).

It gets better. If there is a closed LC expression whose type is a theorem in the corresponding intuitionistic logic, then the expression that has that type is a proof of the theorem. Each beta reduction is equivalent to an inference step in the logic. The logic corresponding to the lambda calculus is it's model. In some sense, the lambda calculus and intuitionistic logic are just different reflections of the same thing.

There are two ways of demonstrating the isomorphism: there's the original way that Curry did it, using the combinator calculus; and there's something called the sequent calculus. I learned it using the combinator version, so I'll run through that quickly. Another day, probably early next week, I'll do the sequent version.

Let's recall what a model is. A model is a way of showing that every statement in the calculus is valid in some concrete universe of values - that there is a correspondence between some set of real entities and the entities in the calculus, where statements in the calculus correspond to valid statements about the real entities. So we don't actually need to do the full isomorphism; we just need to show a homomorphism from calculus to logic. (An isomorphism means you can go both ways: from calculus to logic and from logic to calculus. Homomorphism is only from calculus to logic.)

So what we need to do is explain exactly how to take any complete lambda calculus expression, and translate it into a series of valid intuitionistic logic statements. Since the intuitionistic logic itself has been proven valid, if we can translate lambda calculus to IL, then we'll have proven the validity of lambda calculus - meaning that we'll have shown that computations in the lambda calculus are valid computations, and the lambda calculus is, indeed, a fully valid, effective computing system.

How do we get from combinators (which are just a shorthand for lambda calculus without variables) to intuitionistic logic? It's actually amazingly easy.

All proofs in an intuitionistic logic come down to a series of steps, each of which is an inference using one of two fundamental axioms:

The steps to the model should be pretty obvious now. Types in a lambda calculus correspond to types of atoms in an intuitionistic logic. Functions are inference rules. Each function can be reduced to a combinator expression; and each combinator expression is an instantiation of one of the fundamental inference rules of the intuitionistic logic. The function is a constructive proof of the theorem in the corresponding logic.

Now, is that cool, or what?

(To which any normal person would respond "what". But then, I'm obviously not a normal person, I'm a math geek.)

Friday Random Ten, 6/2

Yes indeed. it's friday again. Where did my week go?

Phish, "My Friend, My Friend". I actually really like certain jam bands. Phish is often quite good. This is one of my favorites of their studio tracks. (And, if you look at the cover of the album, and you draw a line through the arm of the guy on the bed, you can make a 23.5 degree angle! Phish is in on the secret!)

Porcupine Tree, "Prodigal". Pretty much what you expect from PT: somewhat odd dark chords, an unexpectedly smooth voice, kinda catchy melody. Cool stuff. Not their best, but still awfully good.

Sonic Youth, "Disappearer". Sonic Youth often do things with interesting tonalities, microtones, and strange extended jams involving playing distorted guitars with all sorts of peculiar implements. This is a perfect example of that.

King Crimson, "The Power to Believe 1: a Capella". I love the stuff that Fripp and company have been up to in the last couple of years. They've been getting even stranger with time. After the ProjeKCts work, they did this album, and it really shows the way that all that free-improv work affected them.

Miles Davis, "Deception". It's Miles, what more do you need to say?

Nightingale, "The Discharged Drummer". Nightingale is a wonderful French-Canadian acoustic folk trio, and one of the best contradance bands in the world. Beautiful little track, this one.

Broadside Electric, "With her head tucked underneath her arm". Broadside is a Philadelphia based electric folk band which plays a lot of stuff based on ballads from Childe's compilation. They also mix in traditional irish, klezmer, and progressive rock influences. The usual lineup is lead singer/guitarist/dulcimer, electric violin, chapman stick, tinwhistle/oboe, and drum. This is one of their humorous songs, about the ghost of Ann Boleyn haunting the tower of London. (Who doesn't love lyrics like " Along the draughty corridors for miles and miles she goes/She often catches cold, poor thing, it's cold there when it blows/And it's awfully awkward for the Queen to have to blow her nose/With her head tucked underneath her arm!") It's not one of their originals, but who else has recorded it with a klezmer violin solo?

Pat Donohue, "The Other End of the Mississippi River Blues". Pat is the guitarist from the Guys All-Star Shoe Band on Prairie Home Companion. He's the best finger-style guitarist that I've ever seen in person. He's also quite a goof: Pat's own description of what he does is "go out and find perfectly good songs, and then ruin them". This one is one of his originals. Silly, but fun. ("They've got magnolia trees, sweet southern breeze; I've got ice and snow, 26 degrees below".)

Wednesday, May 31, 2006

Improbable is not impossible, redux

I actually decided to say a bit more about the concept of improbable vs. impossible after my last post.

As I've said before, the fundamental idea behind these arguments is what I call the "big numbers" principle: We (meaning human beings) are very bad at really grasping the meaning of very large numbers. It's hard to really grasp what a million is: if you counted one number a second, with no breaks of any kind, it would take you more than 11 days to count to one million. When we start getting to numbers that dwarf a million, our ability to really understand what they mean - to really grasp on a deep, intuitive level what they mean - just completely goes out the window.

What does 10^20 mean? If it takes 11 days to count to one million, then it would take roughly 11 times 10^14 days - or 3 times 10^12 (aka 3 trillion) years to count to 10^20. That just doesn't mean anything on an intuitive level. So what does 10^100 mean?

What happens when we start to look at probability? Well, it's really easy to get huge numbers in probability calculations - the kinds of numbers that we can't really grasp. And because those numbers are so meaningless to us, so utterly beyond the reach of what we really comprehend, the probabilities of astonishingly common events can seem like they're entirely beyond the realm of possibility: they must be impossible, because how could anything so easily comprehensible create numbers so incomprehensible?

Who hasn't shuffled cards? Who hasn't played a game of klondike solitaire? And yet, every time we do that, we're witnessing an event with a probability of roughly 1 in 8 times 10^67!

If something as common as shuffling a deck of cards can produce a result with odds of 1 in 10^67, then think of the probability of something uncommon: what's the probability of the particular arrangement of molecules in a particle of dust? Or the shape of a comets tail? Those are all natural, possible events - and the probability of them happening the way that they do is pretty much beyond our capacity to imagine.

When you grasp that, then you start to realize just how meaningless the big numbers argument is. No matter how improbable something is, that does not mean that it is impossible. Impossible is a probability of 0. And any probability, no matter how small, is almost certainly the probability of something that's occurring in the universe, somewhere, right now.

Repeat after me: Improbable IS NOT Impossible

I got a request to respond to look at this. It's yet another one of those innumerate jackasses arguing that god must exist, because life is too unlikely otherwise.

It's the same old crap, but with one interesting twist - this guy tries to provide an argument for why anything less probable than a specific level of improbability is absolutely physically impossible. In his own words:

In a previous post, I presented a logical, factual argument for the existence of God. But one of my readers, Seanny McShawn, took issue with my syllogisms, saying that I was simply making an argument “from incredulity”.

When someone tries to win an argument based on simple probabilities, this is called an “argument from incredulity.” This is a logical fallacy. In other words, the sheer unlikeliness of a scenario does not preclude its possibility and cannot be proof against it.

But was I arguing from “incredulity”?

The answer is, pretty much, "yes".

Physicists estimate that in our universe there are 10^80 particles. Mathematicians say that the mathematical level of absolute impossibility is 1 chance in 10^50.

Not any competent mathematician.

However, the physical level of absolute impossibility is 1 chance in 10^80, and here’s why:

On the basic level, probability is defined by the ‘particle’ example: finding a specially marked particle among 500,000 particles is beating odds of 1 in 500,000. In a universe that has 10^80 individual particles, the most improbable scenario is finding a specially marked particle in the entire universe. Due to the size of our universe, it is impossible to have a more improbable set of odds than 1 chance in 10^80. Anything that is more improbable than the most improbable is by all standards absolutely impossible.

I have never seen this particle example nonsense. But nonsense is what it is. It is, in fact, astoundingly easy to demonstrate.

That's 286 random characters, produced by pounding my hands on my keyboard, and then deleting any punctuation characters (because I'm too lazy to figure out how many unshifted characters there are on my keyboard, so I stuck to numbers and letters.) The odds of generating that particular string are 36^286. That's quote a bit less likely than 10^80. But guess what? It's possible!

Or, just to annoy the fundies, go get yourself a deck of Tarot cards. (Hey, they're good to have. There are some damned neat poker variants that you play with a Tarot deck, not to mention some funky solitaires.) A tarot deck has 78 cards. Shuffle the deck. Lay the cards out in order. The odds of that ordering? Less than 1 in 10^115. But entirely possible.

What is the probability of a solar flare taking a specific shape? Think of the trillions of trillions of particles in a flare, and the unique shape of each one. The probability of those shapes? Quite a bit worse than 1 in 10^80.

This one in 10^80 stuff is pure nonsense. It's just another attempt to create an arbitrary threshold beyond which we can say something is impossible.

He then goes on to repeat yet another variant of the "odds of life based on the probability of a particular protein" gibberish. I've covered that plenty of times on this blog; for example, my analysis of Berlinski's version of this argument here.

So this guy concludes with:

This is not an argument from incredulity. This is an argument from facts: cold, hard facts. Since any set of odds above 1 in 10^80 is absolutely impossible, random chance could not and did not produce life.

Except that it is an argument from incredulity, utterly lacking in cold, hard facts. In fact, it's trivially refuted by cold hard facts; and it's based on one of the most pathetic attempts I've seen so far to define "impossible" in terms of probability.

Magic 23.5 update: Osborne runs away.

Just thought I'd put up a quick little note here. I found that the author of the "23.5" stuff, after giving up on responding here, has set up a nice little website - sans comments - in which he defends himself from the criticisms here.

Since I went to the trouble of publically presenting his defense on the front page of this blog, and engaging in a civil discussion with him in the comments, I'm more than a little bit ticked at this. He's obviously welcome to handle things as he wants, but I find it rather cowardly of Gary to silently drop out of a public, open discussion here, and then continue it on a private site, where no one who disagrees with him is permitted to respond.

I'm not going to link to it, since he's got rather a lot of advertisements set up (you actually have to sit through an ad on the main link to the site before you're allowed to read any content), and I'd rather not send cash in his direction as a reward for this. If you really care, you can find it either through google, or through the sitemeter referrals link on the bottom of the front page of this blog.

Practical Applications of Good Math: Type Checking in Programming Languages

(Since I see that there are still links pointing at this post, I'll point out here that this blog has moved to scienceblogs. This discussion was continued in a post there.)

I thought today I'd go a bit off topic, and rant a little bit about stuff that's been bugging me lately in my real work. It does, in its bizzare way, relate to the lambda calculus series I've been writing lately.

What I do professionally is research in a field called collaborative software development: how to build tools that help groups of people work together to build large systems. It's an interesting thing to do, with a huge range of things that can be done, ranging from better programming languages to better configuration management systems to better programming environments to entirely new kinds of collaborative tools. It involves practical programming, math, psychology and human factors, and a few other interesting things.

My approach to this kind of work is to view the key problem as being one of communication. Most problems in large systems are ultimately not algorithmic errors - they're errors where some key piece of information was not correctly communicated between developers. For example, one of the most common bugs in large systems occurs when the implementor of a piece of code makes assumptions about how their code is going to be called; and the developer who uses that code makes different assumptions. To be concrete; I experienced this recently on my project at work. I'm working on a bit of code, and one function I call takes a lot of parameters, each of which is an array of objects. Most of the time, you only supply a real value for one or two out of a list of 8 parameters. I assumed that for unused parameters, it made sense to pass null - that's how I would have written that method. The person who wrote it expected callers to never pass null, but to use empty arrays for unused parameters. Program compiles, deploys, and boom: exceptions out the wazoo. (There was a dynamic check for null, but because of some proxy stuff we do, it was very hard to track it down.)

This isn't a result of incompetence on anyone's part. This project has some of the finest developers that I've ever met working on it. The person who wrote that method with the awful parameter list is actually a pretty great programmer - one of the top people from one of the top groups of developers that I know of. I don't think I'm incompetent either :-). So why did the error happen?

Because a key piece of information didn't get passed between us. Between the size of our system, the number of people involved, the way the code is partitioned into components, and where the documentation lives, the information about what to do about empty parameters got lost.

What does any of this have to do with lambda calculus? Types.

That problem wouldn't have happened if we were programming in my favorite programming language, OCaml. In OCaml, to be able to pass a "null" value, I actually need to specify that in the parameter declaration (as an Option type). There's no way I can pass a null to a function that wasn't expecting one.

It's a common assertion among a lot of the elite hackers out there that "real programmers don't use strongly typed languages", or the Paul Graham line about strongly typed languages. For example, here's Paul on Java:

It's designed for large organizations. Large organizations have different aims from hackers. They want languages that are (believed to be) suitable for use by large teams of mediocre programmers-- languages with features that, like the speed limiters in U-Haul trucks, prevent fools from doing too much damage.

This attitude really bugs me. The thing is, types are meta-information about a program - and useful meta-information. And the more people you have involved in a system, the more important it becomes to have strong, formal information about parts of the system that allow automatic checks.

The elitism issue. "Yeah, I'm a super-duper programmer who doesn't make mistakes; people who don't work the way I do are just inferior." Yeah, tell that to the guys who wrote OCaml.

The cluelessness issue: "I write programs in Perl/Python/CommonLisp/... with three other guys who are all lisp wizards, and if we don't have a problem, then no one should, so that typing stuff is just nonsense to cover up for incompetence." There are problems that emerge as the size of a system and the number of people working on it increase. Three people hacking on something is very different from 30 people hacking on it, which is very different from 300 people. As the number of people involved increases, you need to do more to ensure that information gets to where it's needed.

The information loss issue. This is the one that actually bugs me the most. Programmers know what types they're using. They know what types they're passing. They know information about the parameters they're taking, and the parameters that they're passing. Good type system give them a way to write that down - and not just write it, but make it a part of the code, so that it can be tested. Writing it down as a part of the code means that the information is recorded; that it's maintained; and that it's checked. Not declaring types saves you a bit of keyboard time - but it means that there's important information in your head which you're not writing down. (I have exactly the same rant about parallelizing compilers, which is that you have to structure your code very carefully and precisely to allow the compiler to figure out the information that you knew before you wrote the code; but instead of letting you write it down, you have to trick the compiler into figuring it out on its own.)

The most common response from people like Paul to rants like mine is "Strong testing, not strong typing." The idea being, type checking is really something like a very simple kind of compile time testing, which is inferior to a real, complete test suite.

Sure, strong typing in a language is no replacement for testing. But why on earth should I write tests to verify correct typing, when I can just put the type information in line in the code and have it tested automatically? Typing and type checking is one part of the process; having a typed language doesn't mean you shouldn't write tests; writing tests doesn't mean that compile-time type checking is useless. Pulling the typing information out of line into a test is, frankly, just dumb: the more you spread the information around between multiple places, the less likely it is to be maintained, or even found.

I'm a huge unit testing fan; but one of the major limitations of testing is information scattering. You're putting information about the invariants and requirements of your code in the logic of the tests - and because of the multidimensional nature of software systems, that inevitably means that some of the information is here, some there. Trying to assemble the information you need from a list of test failures can be incredibly frustrating; and those tests often don't get maintained, so things change, but the tests aren't changed to thoroughly respect the new semantics. Compile time checking can't get out of synch with the current state of the code.