Wednesday, September 11, 2013

Public availability of phylogenetic data

I have previously noted the frequent failure of phylogeneticists to make their data publicly available (Releasing phylogenetic data ). Recently, a paper appeared in PLoS Biology providing some quantitative data regarding this issue:

While constructing a super-tree of life, Drew et al. noted that of their 7,500 papers (appearing in 2000–2012) the published data (eg. alignment and tree) had been deposited in a public repository in only one-sixth of the cases, and were available on request from the original authors for a further one-sixth, leaving two-thirds of the data unavailable.

Not unexpectedly, they suggest that the journals publishing these papers might play a role in addressing this issue:

Our findings indicate that while some journals (e.g., Evolution, Nature, PLOS Biology, Systematic Biology) currently require nucleotide sequence alignments, associated tree files, and other relevant data to be deposited in public repositories, most journals do not have these requirements.

Notable among the absent journals are high-profile phylogenetic ones such as Molecular Biology and Evolution and Molecular Phylogenetics and Evolution.

Sadly, the role of journals has been presented in a rather poor light by some bloggers. For example, Roli Roberts notes:

And it's clear that journals are indeed spectacularly well-placed to police and incentivise the deposition, tracking, accessibility, and permanence of data associated with the papers that they publish. At the point of acceptance we have the authors over a barrel, and are in a great position to mandate deposition of all data for every paper.

This attitude has been criticized by other bloggers. For example, Rod Page notes:

In my opinion, as soon as you start demanding people do something you've lost the argument, and you're relying on power ("you don't get to publish with us unless you do x"). This is also lazy. I have argued that this is the wrong approach: when building shared resources carrots are better than sticks ... So, my challenge to the phylogenetics community is to stop resorting to bullying people, and ask instead how you could make it a no brainer for people to share their trees.

However, I ask a different question:

Why are phylogeneticists so reluctant to present their actual data in the first place?

After all, without data science is merely opinion, and you don't need to be a scientist to have an opinion. (Even theoretical science ultimately concerns itself with data, so data really is the essence of science.) One does not have to be sceptical about a dataset in order to think that it should be publicly and freely available.

So, why is telling phylogeneticists to act like scientists "resorting to bullying people"? Why do we have to "inspire [people] to contribute" by offering them carrots? It seems to me that we have lost the argument that phylogenetics is science if the phylogeneticists won't behave like scientists.

Note that the alignment is the key thing in phylogenetics, not the derived tree. In one sense, a tree just makes a figure out of a table. So, given the published description of the tree-building method, it should be straightforward to reproduce the tree from the alignment. Indeed, if the tree cannot be reproduced from the alignment then there is serious cause for concern.

In this sense, databases like TreeBASE might be missing the point somewhat. Where does one put the alignment if one is not interested in also storing the tree? Where does one put a network, if that is what you have instead of a tree? One could use Dryad, but they are now insisting on payment for storing scientific data — for those of us without financial support this is no longer a realistic option.

Problems with data availability are not unique to phylogenetics, of course. Dani Zamir has recently noted:

In crop genetics and breeding research, phenotypic data are collected for each plant genotype, often in multiple locations and field conditions, in search of the genomic regions that confer improved traits. But what is happening to all of these phenotypic data? Currently, virtually none of the data generated from the hundreds of phenotypic studies conducted each year are being made publically available as raw data; thus there is little we can learn from past experience when making decisions about how to breed better crops for the future.

Nevertheless, in biology, there are databases for many things, such as gene sequences (Genbank), protein structures (PDB), and gene ontology (GO), and these are all used to one extent or another. Perhaps the most direct parallel to the problems with phylogenetic datasets is that of ecological datasets, as recently discussed in a PLoS ONE article:

It is interesting to ponder why this is such a problem in the biological sciences when it is apparently not so in the physical sciences. There are databases in astronomy, and databases of chemical properties in chemistry, for example, but otherwise it is generally the ability to get the same data by repeating the experiment that is the important thing in the physical sciences. In most cases a database would be not only redundant but also self-defeating (storing the data would imply that the data are not repeatable!).

So, this appears to be yet another by-product of dealing with biodiversity — data are incredibly variable in many areas of biology, and so it is necessary to store them for posterity because they are unique.

2 comments:

This is sad, but it can be worse. In the field of linguistics, where I work, for example, supplementary material is often discouraged by the publishers. I had a hard time when trying to get my supplementary data on certain publications posted online in far too many cases. Recently, I turned to the dryad repository, because the ones responsible for one article where simply not placing the data online. The dryad team asked me why I would turn to them with linguistic data, but when I explained, they agreed. It's very, very sad. In most cases, it is not possible for us to replicate analyses. And even in cases where supplementaries are given, journals seem to prefer PDF files. I know many cases where colleagues were complaining that they had to dig the numbers out of the PDF in order to apply further testing to proposed results. It's just a shame. Glad you brought this issue up here. It is so important that we keep the data that was produced alive for further testing. Sad to hear that biology has similar issues (although less striking than in linguistics). All I think we can do at the moment is to be completely strict regarding the submission of supplementaries on each study we propose...

Yes, this is certainly my experience reading papers in linguistics, that the data are considered to be ephemeral, and thus not worth preserving. It was also my experience as a student in ecology, where data were very hard to obtain — probably <10% of requests. The change of attitude that has occurred in ecology over the past couple of decades should give us optimism for linguistics, however. In the meantime, I agree that persistence is needed. I have also spent time extracting information from PDF files, which has taught me just how many ways there are of formatting PDFs — one has to doggedly try different programs for reading the files, including web browsers. This is ridiculous, given that the data were presumably in a spreadsheet in the first place!