Monday, 3 August 2009

Gene angst: finding a DNA barcode for plants

I've been incubating this post since September 2008, so it's kind of cathartic to finally be writing it. I think it will be a good representation of the title and purpose of this blog in the sense that it's a window to some of those things that go on in science - and in the lives of scientists - that don't make it into the peer-reviewed publications.

So why the wait? On top of that it's inappropriate to talk in public about a piece of research before it's published unless all your co-authors agree (and a quick peek at the number of co-authors on this paper will explain why that was a non-starter), this work involved a lot of personalities and politics - even more than the usual paper - and some rather sensitive discussions and debates were being had right up to the publication date.

Speaking of the publication date, you'd be forgiven for thinking this open access PNAS paper came out on Tuesday; there was, after all, a rash of online and print news items1 and press releases2 about the paper that day, even radio and television interviews. But the paper wasn't published in the Early Editionuntil Thursday. See, PNAS does this weird thing where they lift the press embargoes on all of the papers in each week's issue on Monday night, even though the papers themselves may come out any day that week. I'm not sure why they do this and I find it a little annoying, largely because though we see a flood of news about a paper on Tuesday, it isn't actually available to non-journalists - you know, like those scientist and taxpayer schmucks - until a few days later. The result is that by the time the paper is out it's too late to influence or even critically filter any of the media surrounding it.

But I digress.

'A DNA barcode for land plants' is the culmination of 4 years' blood, sweat and tears work by a global consortium of researchers called the Plant Working Group (PWG) of the Consortium for the Barcode of Life (CBOL).

The purpose of the PWG is to bring plants up to speed with animals in an international effort to build standardised reference libraries of DNA sequences from known and unknown species. These libraries of 'DNA barcodes' will ultimately enable the rapid identification of unknown specimens (or fragments of specimens) even by non-experts. In the meantime the collaborations and frameworks created to build the libraries will, in the words of John F. Kennedy from his famous "We choose to go to the Moon" speech, "serve to organize and measure the best of our energies and skills."

Because I've blogged about DNA barcoding several times before3, both here and on The Beagle Project Blog, I'm not going to give you a lengthy background on barcoding in this post. Rather, I'll explain briefly why plants needed bringing up to speed in the first place, but then move on quickly to how we did it, and what it was like to be involved.

Why have plants lagged behind animals in terms of amassing DNA barcode reference libraries? It's not that botanists aren't keen to participate. Rather, it's that the gene chosen (and officially endorsed by CBOL and therefore GenBank) to serve as the DNA barcode for animals, CO1, though present in plants, is not variable enough to use in species identification. So the search was on for a CO1 equivalent in plants: a region conserved enough through evolution to be found in and easily amplified from every plant's genome but carrying enough variation to distinguish species.

The approach CBOL took to finding such a region was to assemble a consortium of botanists actively working on DNA barcoding, and to pay for them to have meetings with each other in order to hash it out amongst themselves. As someone working on DNA barcoding plants at the Natural History Museum, I was invited - along with several others - to join in.

This was my first time as a direct participant in science-by-consortium and boy, was it an eye-opener. It turns out trying to get scientists - botanists no less (eek!) - to agree on something is not as easy as one might imagine. (There is a long and inglorious history of botanists disagreeing, but I've already indulged in one digression today...)

The Taipei meeting was widely believed and reported to be something of a mess, with lots of claim-staking but not much progress towards the all-important Final Decision. I vividly remember one moment from the meeting in which we used a white board to list all of the candidate plant barcode regions (and combinations of regions). I photographed the white board (right). Looking back at it now, I think this picture speaks a thousand words with regard to the indecision that was left hanging in the air after Taipei.

The Edinburgh meeting, on the other hand, was more focused, with a mandate to have a decision made before everyone went home. Ably chaired by Pete Hollingsworth, head of the Genetics and Conservation section at the Garden, we spent two days (rather than two hours, as in Taipei) focused on the task.

I can't speak for anyone else, but I personally found the Edinburgh meeting to be a whole lot of fun. In essence, we - 15 plant DNA barcoding specialists from around the world - locked ourselves in a small room and agreed not to come out until we had made a decision. Coffee was administered by IV drip and snacks and sandwiches delivered to an adjacent room for when our brains ran out of ATP. Unlike the Taipei meeting, we had lots of data to hand in Edinburgh. Print-outs of spreadsheets and figures flew around the room like so much confetti and got annotated by hand as they were discussed.

Participants of the Plant Working Group meeting in Edinburgh emerged breifly from their self-confinement for a group photo.

I mentioned data. Our group from the Natural History Museum in London contributed amplification success rates and DNA sequences for six regions from 138 flowering-plant specimens. These specimens were collected during our project to repeat Darwin's botanical survey of Great Pucklands Meadow at Down House (pause for one of those 'oh if Darwin only knew about DNA' moments). This might seem like an impressive amount of data but in fact it was a modest contribution; some of the other groups contributed not hundreds but thousands of sequences. All in all the various research groups contributed data from 907 specimens from 550 species representing the major groups of land plants (including 670/445 angiosperm, 81/38 gymnosperm, and 156/67 cryptogam samples/species) for up to seven candidate regions that had been flagged in Taipei. These regions are, in no particular order, the genes rpoC1, rpoB, matK and rbcL and the inter-genic regions psbK-psbI, atpF-atpH and trnH-psbA.

Back to our little room in Edinburgh. In some cases we analyzed this mountain of data right then and there, and in other cases, as when there were gaps in our data set that still needed filling, we agreed to go back home and churn out those data pronto.

One of the more illuminating analyses we did was to compare how well all possible combinations of one, two, three and seven candidate regions performed in terms of discriminating species. We were (or at least I was) surprised to find that while increasing the number of regions used in combination from one to two improved the power of species discrimination, combinations of three or more weren't any better (right, Figure 1C from the paper).

In addition to discriminatory power, we also looked at practical issues like universality (i.e., the rate at which we were able to successfully amplify any given region from our collection of specimens) and sequence quality (e.g., the frequency of high-quality sequences obtained for each region, the amount of manual editing required and the concordence of bidirectional sequence reads).

Ultimately, after all of these analyses, there was no obvious winner, no gleaming silver bullet. And so began the war of attrition, during which we said our tearful goodbyes to certain regions that were okay in terms of universality and sequence quality, but pretty useless for species discrimination (as was the case for two regions, rpoC1 and rpoB), or good at species discrimination but with poor amplification success rates and sequence quality (as was the case for psbK-psbI).

After this weed-out process, we were left with three regions - two genes, matK and rbcL, and one intergenic spacer region, trnH-psbA. Though these three outperformed the rest none of them alone performed ideally for all three criteria.

At this stage there was an intense discussion about whether we should recommend all three as a combinatorial plant DNA barcode to CBOL, or just two of the three. Some in the group preferred the better-safe-than-sorry approach of a three-region barcode that could be pruned down to two at a later date if one of the three proved superfluous. The majority, however, thought a two-region barcode preferable because it would be both be less expensive in terms of sequencing costs and also because it was felt that we needed to be decisive; many would-be plant barcoding projects were being denied funding as a result of funding agencies fears that their money might be wasted if CBOL shifted the goalposts. Moreover, as I said above, though two regions are better than one at discriminating species, three are not better than two.

So of the three remaining regions, we tasked ourselves to decide which two in combination to recommend to CBOL as 'the' plant DNA barcode. It made sense to choose two regions which would complement each other: one with high universality and sequence quality and good, but not great discriminatory power (rbcL), the other with better discriminatory power but needing further technical work to improve universality (matK) or sequence quality (trnH-psbA). In the end, the group felt it was easier to overcome the universality difficulties posed by matK than the sequence quality difficulties posed by trnH-psbA.

And there we have it: the Plant Working Group recommends that CBOL adopt4 the combination of rbcL and matK as the official plant DNA barcode.

So that's the story of the scientific process that the Plant Working Group went through to select a DNA barcode for plants, but before I end I want to say a little bit more about the political and social process. If you read between the lines of my account here, you can probably guess that there were some intense disagreements between various members of the working group over how many, and which, regions to select. This begs the question, why would anyone care? It's supposed to be cold, hard, evidence-based science, right?As PWG member Damon Little carefully said in his WNYC radio interview, '...when this started, a lot of people...[had] their favorite region for various reasons,...because they were the ones that discovered it or...because it was a region that had worked well for them in the past...' In other words, different research groups involved had to some extent pinned their reputations on certain candidate regions. As a result, they advocated those regions for a combination of political and historical reasons as well as scientific reasons.

But it wasn't all sorrow and strife. As you can imagine, after the workshop was over, there was a sense of relief and accomplishment - and for some, lingering frustration - and how better to mark the occasion than by refreshing ourselves at the Scotch Malt Whisky Society Vaults in Leith (right)?

And now we have finally come to my last bit of data in this blog post ...consider it supplementary data to Science Creative Quarterly's 'manuscript' entitled 'Scientists will geek out under any circumstances': at the Whisky Society, we were treated to PWG chariman Pete Hollingsworth's expert tutelage in whisky tasting. Here are some of the various drams we tried:

Whisky tasting with the Plant Working Group. Crop at right shows drams labeled by distillery (actually they don't tell you which distillery they're from, so these are actually Pete's guesses).

As is only natural, our conversation turned to DNA barcoding, and we noticed that, just as whiskies have thier own personalities, so do the plant barcode candidate regions. Moreover, we figured these personalities could be mapped onto one another...

11 comments:

Very interesting post. I can see the huge benefits of agreeing standards like these, but is there a danger that they will become the *definition* of what constitutes a species, irrespective of what the lumpers and splitters might argue? If so, it's to be hoped that you chose wisely!

...is there a danger that they will become the *definition* of what constitutes a species...?

There are several answers to this, and which answer you get depends on who's doing the answering. Here's a sample of how different researchers might respond:

1. Maybe, but that danger has been around ever since people started using DNA sequence data in taxonomy and systematics. Using DNA data alone to define species will never stop being wrong no matter how DNA barcoding unfolds.

2. No, because the rules governing how one proposes and names a new species haven't changed.

3. Yes, but it's not a 'danger' - it's an 'opportunity'! As the microbiologists will tell you this is a very practical way to go about taxonomy.

4. Yes, but only as a temporary designation - a placeholder - until someone comes along and does a proper taxonomic treatment.

@Graham - Using the current barcodes, the answer is a definite 'no' - both rbcL and matK are in the chloroplast genome, which is maternally inherited, so hybrids will always be mistaken for the maternal species. However, several groups are working on developing complementary identification systems using nuclear genes that could be used to ID hybrids.

Hi! I'm really interested in Plant DNA Barcoding. I just want to ask if I want to check the ability of the recommended barcodes to discriminate between species of a plant how many species do I need and how many individuals per species will be necessary to make the results acceptable?Thanks so much!

Hi Faye, your question is, in essence, the question we're all going to be working on from now forward. But we know enough now to say that you can expect the answer will vary depending on what kind of plants you're working with, whether hybridisation is common, and how geographically constrained the area is that you're working in.

About Data Not Shown

After about a year of blogging over at The Beagle Project Blog, I realised there were certain things I wanted to write about that might be considered a leeeetle too opinionated or off-topic to warrant posting there. See, The Beagle Project is a team effort, and I had the growing worry that with certain posts I might be infecting the others (and the project as a whole) with Karen cooties. So, while I still consider The Beagle Project Blog my primary blogspot, the views expressed over here at Data Not Shown are mine ...my own ...my precious. You can learn more about Data Not Shown, including an explanation of the title, in my inaugural post.