Sunday, January 14, 2007

One common complaint that I've come across relating to evolutionary biology is that it is unfalsifiable. This is technically correct. The broad statement "stuff evolves from other stuff" cannot be conclusively falsified. Fortunately, Real Scientists don't leave it at that - they actually go into detail on how evolution occurred.

One example of how this extra specificity can render evolution falsifiable is given by the notion of common descent. Common descent is not falsifiable in the general case - you can never be sure that any new species can't be fitted somewhere on the family tree. However, one prediction that can be derived by considering this concept in specific cases is: the phylogenetic tree of any set of organisms is fixed.

Say we choose four organisms, then consider ten genes that occur in all of them. For each of these genes we can draw a graph or network of how the organisms appear to relate (more on how to do this later). If those graphs are not equivalent, that indicates that the family tree is not the same for each gene. This would falsify common descent.

With this approach in mind, I present a step-by-step HowTo guide to falsifying evolution.

Step 1: Pick two or more genes

It turns out that it's generally easier to pick a couple of genes first and see which species their sequences are available for than to pick a couple of species and try to identify genes they have in common.

If you're after a list of genes to consider, why not spend some time sticking random searches in the NCBI's Entrez Gene database frontend?

For this example, I have chosen the genes HoxA5 and HoxB5. No particular reason; I'd just heard hox genes referred to before.

Step 2: Pick four or more species

If you pick fewer than four species, it is impossible for the graphs to be distinct - three points can only be connected up in one way. With four species, you have at least two ways (see below). More species would be interesting, but let's keep it simple.

To confirm that your genes are available for the species you're interested in, go to the NCBI's Homologene site and type your gene's name into the search bar.

For this example, I'll be considering homo sapiens (humans), pan troglodytes (chimpanzees), mus musculus (mice) and rattus norvegicus (rats). This is because these species are all available for the genes I'm interested in.

Step 3: Determine the distance between the species' genes

For each gene, and for each pair of species, you need to determine the extent of the difference between the exact forms of the gene in each species. This can be thought of as a measure of distance - two variants of the same gene will be separated by a given number of mutations.

The easiest way to do this is to cheat and use NCBI functionality again. In the case of HoxA5, for example, I would go to its HomoloGene page and click on the link titled "Show table of pairwise scores". This brings up a table of alignment scores - we're interested in the "d" (distance) values.

Step 4: Create a phylogenetic graph for each gene

We now have enough information to use a technique called the nearest-neighbours algorithm to derive a phylogenetic graph. A phylogenetic graph is basically a family tree of the species, except without any indication of where the last common ancestor fits into the picture. They look something like this, only without the "X marks the spot" in the middle.

The advantage of a phylogenetic graph rather than a tree is that it is actually impossible to figure out which part of the tree is the "root" from theoretical techniques alone. An unrooted tree, a phylogenetic graph, is therefore used instead. These are quite easy to generate from bioinformatic data. The easiest algorithm, and the one we're using here, is called the Nearest Neighbour algorithm. There are more reliable algorithms, but this one is handy because we only need to worry about the distance between gene variants rather than their actual encoded content.

To produce the graph, you can use an online tool like this one - just fill in the distance data from HomoloGene. This particular tool produces a number of trees, which all represent the same graph with differently-positioned roots. You'll need to figure out what the original graph looks like. For example, this tree:is structurally equivalent to this graph:(note: 1 is human, 2 is chimp, 3 is mouse, 4 is rat)

Step 5: Compare the phylogenetic graphs

If you find, as I did, that the two genes you have chosen give phylogenetic graphs that are precisely equivalent, tough luck: you've failed to disprove evolution. In fact, you've actually reinforced it slightly - in the absence of some variant of evolutionary biology, there is absolutely no reason to expect this law of phylogenetic graph equivalence to hold.

It is certainly not characteristic of any designed system. If you ran this test on computer code, on engineering designs, on literature you'd continually find situations where it didn't hold - chimaeras that built on many existing traditions rather than just the one. If the living world was indeed designed by an intelligent, purposeful entity then He must have gone out of His way to give the impression that evolution was responsible.

If you find that the two genes give phylogenetic graphs that are not equivalent, congratulations: you may have falsified evolution. If you've pulled off this trick, I would ask that you list the genes you used in the comments section of this post, so I can confirm your results. If there are no mistakes in your working, we can try some more accurate phylogenetic graphing algorithms, and if the results are still positive then quite frankly you're looking at a Nobel Prize here.

Microsoft SQL Server Business Intelligence Development Studio is an interesting product. The basic idea of it is simple: most companies have lots of databases floating around with information that's potentially useful to those in power. Hence, why not make it easy to produce lots of reports capable of digesting that information into pretty charts?

The idea is good. The implementation is atrocious. I'm currently stuck in the midst of one particular conundrum, which I wish to share.

BIDS uses the same interface as MS Visual Basic - it's a grid that you can position block elements such as charts and textboxes on. Whilst this is extremely effective for small reports, it's bloody awful for large ones - if you want to make a change to the size of the top element you have to select every other element of the report and shift them all a bit. Which you can't do. For reasons known only to themselves, Microsoft have made it very difficult to select more than a screen's worth of elements at any given time.

Of course, there are workarounds. The best one by far is to create a subreport - a separate report that can be embedded into the main report. This is represented in the main report by a fairly small block element, so can be easily moved around. A nested set of subreports can be used to create a fairly elegant layout. So far so good - why am I complaining?

The problem with this approach relates to the means by which data is imported into BIDS. Each report actually has two parts - a backend consisting of one or more SQL queries ("datasets"), and a frontend consisting of pretty charts etc. The upshot of this should be obvious: if you want to use the same dataset in more than one report, the only way to do it is to include a copy of the same SQL code in each report*. This is unmaintainable - with enough reports, you'll end up with multiple versions of the same SQL, all of which produce subtly different results. So much for the elegance of subreports.

I'm particularly peeved about this problem because it is so completely unnecessary. Simply by introducing the backend and frontend as separate objects that could be linked as appropriate, this quite major problem could have been avoided. But it's fairly clear that the developers of the system never thought of this - they just took the existing concept (individual reports a la Crystal Reports) and built a system that could create a bunch of them in parallel. This product is fundamentally not designed to produce suites of reports.

There is one positive upshot, though. Next time someone complains to me that Microsoft's offerings are so much more ready for primetime than Open Source stuff, I'll have a really good counterexample handy...

* There is another way, which is to use the SQL to create a View (basically a dataset) embedded in the database itself. However, most non-programmers can expect to receive write access to the database schema shortly after the mercury freezes in Satan's thermostat.
Read the full post