This paper discusses a method and associated piece of software for assessing the statistical support for different root positions based on the posterior output of trees from software such as BEAST.

Phylogenetic analysis of Guinea 2014 EBOV ebolavirus outbreak

Gytis Dudas (@evogytis) & Andrew Rambaut (@arambaut)

A recent article in The New England Journal of Medicine (Baize et al., 2014) suggests that the currently ongoing outbreak in Guinea is caused by a divergent variant of the Zaire ebola (EBOV) lineage. The EBOV strain has previously caused ebola outbreaks in the Democratic Republic of Congo and Gabon. The authors publish three complete genome sequences from the Guinea outbreak and perform a phylogenetic analysis using 24 other Zaire lineage and some representative lineages. One finding of this is that the 2014 sequences fall as a divergent lineage outside the Zaire lineage suggesting that this may be a pre-existing endemic virus in West Africa rather than the result of spread of the EBOV lineage from the Central African countries that have had previous human outbreaks.

All complete genome sequences from the genus Ebolavirus (which includes Bundibugyo BDBV, Reston RESTV, Sudan SUDV, Tai Forest TAFV and Zaire ebolavirus EBOV species) were collated from genbank including the sequences from the Guinea outbreak. A list of sequences here and the publications they are from is available here. Thanks to Stephan Günther and his co-authors for sending the sequences from the NEJM paper.

A simple alignment of the complete genome and a maximum likelihood tree confirms the phylogenetic position shown in the NEJM paper:

We extracted the coding sequences of each gene (ebolavirus genome contains 7 protein coding genes separated by various intergenic regions with several functions). It seems that removing these intergenic regions results in the Guinea outbreak sequences falling quite firmly within the diversity of Zaire ebolavirus. We suspect that the discrepancy arose through long branch attraction. Here is a MrBayes Bayesian phylogeny of the coding regions:

Expanding the EBOV region of the tree (this is the same tree but with the divergent ebolavirus species cropped out) we see the position of the Guinea outbreak nested within the EBOV clade.

If we analyse only the intergenic regions we see a similar picture with the Guinea sequences nested within EBOV:

Note that these two trees are essentially identical but differ by where the other ebolavirus species root the EBOV clade (on the 2007 Gabon outbreak for the coding regions and on the 1995 Kikwit outbreak for the intergenic regions). This shows that the rooting of this clade using the very divergent other ebolavirus species is very problematic.

However, EBOV is estimated to evolve at about 7×10−4 substitutions per site per year (Carroll et al, 2013) which means that the virus will accumulate significant amounts of substitutions over the nearly 40 years since the first recorded outbreak in 1976. We can use this to root the EBOV tree and look at where the Guinea outbreak lies. Here we estimated the phylogeny of the coding sequences using MrBayes (a maximum likelihood tree using PhyML gave an almost identical tree). We then used the software Pathogen to find the root that gave the best association between genetic divergence and time. This rooted the tree close to the earliest sequences (Zaire 1976 outbreak) with the following relationship:

The resulting tree is as follows:

The Bayesian posterior support for all the groupings between the outbreaks are 1.0 including for the grouping of Guinea 2014 with DRC 2007 and Gabon 2002. This demonstrates that the uncertainty about the position of the Guinea 2014 lineage in the complete ebolavirus trees was down to the rooting of the EBOV clade (i.e., where the divergent outgroups connect to the EBOV tree). The relationships of the EBOV outbreaks is completely consistent for the simple whole genome alignment, the coding regions only and the intergenic regions only but the position of the root changes. In the above figure, A) denotes the position of the root for the full genome maximum likelihood tree, B) for the Bayesian coding-sequence only tree, C) the Bayesian intergenic regions only tree and D) the combined coding-sequence and intergenic region accommodating different rates of evolution.

From this we can see that it is likely that the viruses causing the 2014 outbreak in Guinea and West Africa have spread from Central Africa at some point since the early 1990s. Without viruses from the animal reservoir sampled over this time frame and across the geographical range, it is probably not possible to say more.