Wednesday, July 20, 2016

All the lonely sequences. Where do they all come from?

Comparative phylogeography across a large number of species allows investigating community-level processes at regional and continental scales. An effective approach to such studies would involve automatic retrieval of georeferenced sequence data from nucleotide databases (a first step towards an ‘automated phylogeography’).

The study began with about 1.1M accessions representing over 20,000 species but these impressive numbers shrank quite rapidly. Not unexpectedly, only 6.2% (some 70,000) of the retrieved GenBank submissions actually reported geographical coordinates and, even more concerning, the colleagues didn't noted any increase in recent years. The team also made an attempt to increase this number by developing scripts that assign geographical coordinates from textual context (e.g. keywords in publication, country information and so on). This geocoding raised the number of georeferenced accessions to 15.1%.

What I find most remarkable is that BOLD accessions, which represent only 3.4% of the analysed data, contributed a large portion of the total georeferenced (including accessions with geocoding) sequences (20.2%), and about half (47.3%) of the originally georeferenced accessions. This is not surprising as the DNA barcoding community naturally sees the value in sharing this information and BOLD supports as part of its metadata package. Actually, it is at least partially enforced as it is not possible to generate records on BOLD without basic information on the country of origin. The same requirement is part of the to-dos in order to obtain the BARCODE keyword for a GenBank record. Furthermore, researchers are always encouraged to provide lat/lon information to BOLD. Interestingly, tetrapod barcoding data are likely rather small in comparison with other datasets, e.g. fish or arthropods. A similar analysis of the latter should provide even higher proportions because the amount of fully geo-references records of insects on BOLD should reach 4 Million.

The authors try to answer the question why so many date are submitted without detailed georeferencing and they come up with three of them:

(1) genuine lack of precise geographical information;

(2) unwillingness to reveal sensitive data (e.g. for samples from threatened species or populations); (3) lack of interest and awareness about the potential importance of direct georeferencing of data deposited in nucleotide databases for large-scale reanalysis of sequence data

I am afraid that (3) accounts for most cases. The vast majority of modern field collections use GPS data and the percentage of sensitive data is very small. Often, GPS data are still used but the precise location is masked by manually decreasing the GPS precision (cutting off some decimals or seconds/minutes does the trick). It is far too easy to just submit sequences with the minimum requirement for metadata to the INDSC databases. Only community efforts with agreed upon standards (such as barcoding campaigns and projects, e.g. iBOL) can lead to an improvement unless GenBank and Co want to change their rules.

Although geocoding offers a partial solution to the scarcity of direct georeferencing, the amount of data potentially useful for automated phylogeography is still limited. Strong underrepresentation of hard-to-access areas suggests that sampling logistics represent a main hindrance to global data availability. We propose that, besides enhancing georeferencing of genetic data, future research agendas should focus on collaborative efforts to sample genetic diversity in biodiversity-rich tropical areas.