This article has a correction. Please see:

ABSTRACT

Although many novel members of the Coronaviridae have recently been recognized in different species, the ecology of coronaviruses has not been established. Our study indicates that bats harbor a much wider diversity of coronaviruses than any other animal species. Dating of different coronavirus lineages suggests that bat coronaviruses are older than those recognized in other animals and that the human severe acute respiratory syndrome (SARS) coronavirus was directly derived from viruses from wild animals in wet markets of southern China. Furthermore, the most closely related bat and SARS coronaviruses diverged in 1986, an estimated divergence time of 17 years prior to the outbreak, suggesting that there may have been transmission via an unknown intermediate host. Analysis of lineage-specific selection pressure also indicated that only SARS coronaviruses in civets and humans were under significant positive selection, also demonstrating a recent interspecies transmission. Analysis of population dynamics revealed that coronavirus populations in bats have constant population growth, while viruses from all other hosts show epidemic-like increases in population. These results indicate that diverse coronaviruses are endemic in different bat species, with repeated introductions to other animals and occasional establishment in other species. Our findings suggest that bats are likely the natural hosts for all presently known coronavirus lineages and that all coronaviruses recognized in other species were derived from viruses residing in bats. Further surveillance of bat and other animal populations is needed to fully describe the ecology and evolution of this virus family.

The majority of human emerging infectious diseases over the past few decades, including AIDS, Ebola fever, avian influenza, and severe acute respiratory syndrome (SARS), have resulted from the interspecies transmission of zoonotic RNA viruses (5, 16, 21, 33). The identification of closely related viruses in animal hosts is a prerequisite for establishing the ecology of viral emergence along with reconstruction of evolutionary pathways; however, complex ecosystems make this difficult (41). This information is critical for the control of these diseases at their sources and for intervention in possible outbreaks.

Coronaviruses have the largest RNA viral genomes, ranging from 26 to 32 kilobases in length (20). They have been known to cause upper and lower respiratory diseases, gastroenteritis, and central nervous system infection in a number of avian and mammalian hosts, including humans (43). Coronaviruses are placed within the family Coronaviridae, a member of the order Nidovirales, and are classified into three groups based on genetic and antigenic relationships (9, 20).

Uncertainty exists in the classification of SARS coronavirus (SARS-CoV). Initially, SARS-CoV was proposed as a novel group 4 coronavirus (24). Comparative genetic analyses of individual gene segments showed that SARS-CoV was consistently a sister to group 2 CoVs, suggesting that SARS-CoV is an early divergence from other established group 2 viruses. Therefore, it was suggested to be a group 2b CoV (30), and in the current International Committee on Taxonomy of Viruses classification system, SARS-CoV is classified as a group 2 coronavirus. However, phylogenetic and pairwise comparisons of SARS-CoV with other coronaviruses, including those from bats (BtCoVs), showed that they share low homology to both group 2 and group 3 coronaviruses and therefore may belong to a new group (putative group 4) (35). As such, we have classified SARS-CoV in group 4 in this study. The recognition of SARS-like coronaviruses (SARS-like CoVs) in bats suggested that bats might be the most likely natural reservoirs of SARS-CoV (22). Comparative genetic analyses of coronaviruses isolated from bats in China also showed that bat species harbor highly diverse coronaviruses, including a novel lineage (putative group 5) that is exclusive to bats (35). This information suggests that the diversity of coronaviruses is greater than previously recognized and that it is necessary to explore the role of bats in the ecosystem of coronaviruses (35, 45).

Since the onset of the SARS epidemic in early 2003, intense scientific effort has focused on identifying the zoonotic source of SARS-CoV and its transmission route to humans (11, 31, 38, 46). Surveys of domestic and wild animals in southern China revealed that civet cats, raccoon dogs, and ferret badgers from wet markets were vectors for SARS-CoV outbreaks in 2002 and 2003 (11, 31). This was also supported by serological studies that indicated that wildlife traders, animal slaughterers, and restaurant workers with occupational exposure to wild animals, particularly palm civets (Paguma larvata), had a higher SARS-CoV antibody prevalence than did market and community controls (11, 46). However, farm populations of palm civets were negative for SARS-CoV, while market populations tested positive, suggesting that market animals were most probably intermediate hosts for SARS-CoV and were possibly infected from another source in the markets (11, 31, 38). While SARS-like CoVs have been identified from four bat species (genus Rhinolophus) in both northern and southern China, phylogenetic relationships and low genetic similarities to human and civet SARS-CoVs indicate that the natural reservoir of these viruses has still not been determined.

In this study, we estimated and compared the divergence times and population dynamics of all coronavirus groups, including new data from wild animals, to better understand the ecology and evolutionary pathways of these viruses. Our results show that BtCoVs are older than coronaviruses from any other animals, while analysis of population dynamics revealed that coronavirus populations in bats have constant population growth and that viruses from all other hosts show epidemic-like increases in population. These results indicate that diverse coronaviruses are endemic in different bat species, with repeated introductions to other animals and occasional establishment in other species. In addition, divergence dates for SARS-CoV and SARS-like BtCoVs indicated that the SARS-CoV precursor may have circulated in an unidentified host before the 2003 epidemic. Lineage-specific positive selection analysis also suggests that the SARS-CoV precursor has not been identified. Our findings suggest that bats are the natural hosts for all coronavirus lineages and that all coronaviruses recognized in other species were derived from viruses residing in bats.

MATERIALS AND METHODS

Sampling and virus detection.During 2003, wild animals were sampled at the live animal markets in Shenzhen and Guangzhou, Guangdong Province, China. Rectal swabs were taken, placed in transport medium, kept in liquid nitrogen for transportation to the laboratory, and then stored at −80°C. Viral RNA extraction, PCR, and sequencing were conducted as previously described (35).

Phylogenetic analysis.For the reconstruction of evolutionary pathways of coronaviruses isolated from bats, coronaviruses representing the three traditional viral groups, SARS-CoVs from humans and civets (group 4), and coronaviruses from bats (group 5) were included in analyses. Two previously unpublished coronavirus sequences from a Chinese ferret badger and raccoon dog (CFBCoV/DM95/2003 and RDCoV/GM43/2003, respectively) were also included in the analysis. The helicase (HEL) domain of the replicase gene, the spike protein (S) gene, and the nucleocapsid protein (N) gene were chosen for analysis based on function and the availability of BtCoV sequences. Nucleotide sequence alignments were carried out using TransAlign (4) with ClustalX (37) to conserve codon positions and were manually optimized wherever necessary using Se-Al, version 2.0 (http://evolve.zoo.ox.ac.uk/
). All ambiguously aligned regions were removed prior to phylogenetic analysis. Phylogenetic trees were built using neighbor joining in PAUP*, version 4.0b (34), with the appropriate model of sequence evolution selected by MrModeltest, version 2.2 (J. A. A. Nylander, Evolutionary Biology Center, University of Uppsala, Uppsala, Sweden). Gaps were treated as missing data in all analyses. The same model of evolution was used to calculate Bayesian posterior probability in MrBayes, version 3.1 (15).

Estimation of divergence dates and population dynamics.The divergence dates for coronavirus lineages were estimated based on an alignment of complete HEL domain sequences, using the uncorrelated relaxed clock model in BEAST, version 1.4 (http://evolve.zoo.ox.ac.uk/beast/
). Under this model, the rates were allowed to vary at each branch drawn independently from an exponential distribution (8). Sampling dates for each isolate were used as calibration points, and constant population coalescent priors were assumed for all data sets. Depending on the data set, Markov chain Monte Carlo (MCMC) samples chains were run for 50 to 150 million generations, sampling every 1,000 generations under the HKY nucleotide substitution model and allowing γ-rate heterogeneity for all data sets. The convergence of MCMC chains was confirmed for each data set by using the program Tracer, version 1.3 (http://evolve.zoo.ox.ac.uk/tracer/
). The times of divergence were estimated utilizing priors obtained from earlier runs, with a discarded burn-in of approximately 10%. The highest posterior density (HPD) was calculated for each of these parameters.

Coronavirus population dynamics were evaluated for eight independent data sets, using the constant, exponential, and logistic population models in BEAST, version 1.4. The posterior likelihoods of the three models were compared using the Akaike information criteria, with one parameter distinguishing the constant and exponential models, while two parameters distinguished the constant and logistic models (28). MCMC chains were run as previously described, and substitution rates and growth rates with HPDs were calculated.

Detection of recombination and lineage-specific selection.Since the presence of recombination among sequences can obscure the detection of positive selection (18), we first tested for recombination in our data set (see below) by identifying breakpoints, using a genetic algorithm for recombination detection (GARD) (17). Identified breakpoints were then assessed by testing the congruence of neighbor-joining trees generated by GARD for each fragment, using the Shimodaira-Hasegawa test in PAUP* (34). Analysis of lineage-specific selection pressure in the S genes of SARS-CoVs was conducted using the Genetic Algorithm (GA) in the program package HYPHY (19). We analyzed an alignment of the complete S genes (3,783 nucleotides and 1,261 codons) of six BtCoVs and three civet and three human SARS-CoVs and specified a neighbor-joining input tree (GTR+I model) generated in PAUP* as described above. The GA in HYPHY assigns four classes of nonsynonymous and synonymous nucleotide substitution rate ratios (ω = dN/dS) to each lineage in search of the model of lineage-specific evolution that best fits the data (18). The Akaike information criteria were used to select the best-fitting model of lineage-specific evolution. The probability (≥95%) of ω being >1 along a specific lineage was calculated from the averaged model probability of all models rather than by inference from the single best-fitting model (18). This approach does not require any a priori hypothesis of lineage-specific evolution.

Nucleotide sequence accession numbers.The sequences reported in this paper have been deposited in GenBank under accession numbers EF192155
to EF192160
and EF434376
to EF434381
.

RESULTS

Phylogenetic relationships of coronaviruses.Evolutionary relationships of BtCoVs were reconstructed by neighbor-joining and Bayesian analyses of BtCoVs and representatives of all other coronavirus groups. We analyzed the complete HEL domain, generally known by the name nsp 13, of the multidomain replicase gene, the structural spike protein (S) gene, and the nucleocapsid protein (N) gene. Due to the high divergence in the S gene, 1,579 ambiguously aligned characters were removed, resulting in a final alignment size of 2,204 nucleotides. The final sizes of the HEL domain and N gene alignments were 1,797 and 1,812 nucleotides, respectively. The multiple sequence alignments for the HEL domain, S gene, and N gene that were used for analyses are available in Treebase (http://www.treebase.org/treebase/
).

Analyses of all three genes showed that all five coronavirus groups are monophyletic, with high statistical support, and that BtCoVs are found only in groups 1, 4, and 5 (Fig. 1). However, support was lesser for some of the deep nodes, particularly in the S gene tree. Gene phylogenies also showed that interrelationships between groups were not consistent between different genes. In the HEL domain and N gene trees, group 4 SARS-CoVs and SARS-like BtCoVs form a sister clade to group 5 BtCoVs, while group 2 coronaviruses are, in turn, the sister clade to both groups 4 and 5 (Fig. 1A and C). However, these relationships change in the S tree, whereby groups 2 and 5 cluster together and group 4 coronaviruses are the sister clade to both groups (Fig. 1B).

Phylogenetic relationships of helicase (A), spike (B), and nucleocapsid (C) genes based on alignments of 1,797, 2,204, and 1,812 nucleotides, respectively. Breda virus (AY427798) was used to root the trees. Numbers at branch nodes indicate neighbor-joining bootstrap values of ≥50%. Nodes with Bayesian posterior probabilities of ≤95% are indicated with asterisks. Bar, 0.1 nucleotide substitution per site (A) or 100 nucleotide substitutions per site (B and C).

The phylogeny of the HEL domain, for which the most sequence data from bats are available, showed that six BtCoVs (BtCoV/Rp3, BtCoV/HKU3, BtCoV/279/05, BtCoV/273/05, BtCoV/Rm1, and BtCoV/Rf1) belong to group 4 and form a sister clade to SARS-CoVs isolated from humans and civet cats between 2003 and 2004 (Fig. 1A). Therefore, analysis of available sequence data suggests that bats are the predominant carriers of SARS and SARS-like CoVs. A further five BtCoVs isolated from southern China form group 5 and are only distantly related to other coronavirus groups, as reported earlier (35).

In conjunction with previous studies, phylogenetic analysis of the HEL domain and the N gene showed that group 1 coronaviruses form two separate subgroups (Fig. 1A and C) (10, 11). The first subgroup contains all group 1 BtCoVs, all common cold coronaviruses (HCoVs 229E and NL63) from humans, and porcine epidemic diarrhea virus (PEDV), indicating that bats may be the natural reservoirs for PEDV and common cold coronaviruses. In the HEL domain tree, four BtCoVs, represented by BtCoV/512/05, are most closely related to PEDV (Fig. 1A). This relationship is maintained in the N gene, but it must be noted that N gene sequence data for three of the viruses were not available. The second subgroup contains the two wild animal coronaviruses (raccoon dog CoV/GZ43/03 and Chinese ferret badger CoV/DM95/03) along with canine and feline coronaviruses and porcine transmissible gastroenteritis virus (TGEV). However, these subgroups are not apparent in the S gene tree. Three of the BtCoVs (BtCoV/773/05, BtCoV/911/05, and BtCoV/977/05) now fall in a basal position within group 1, while the remaining BtCoVs, represented by BtCoV/512/05, maintain a close relationship to PEDV (Fig. 1B). Taken together, the above results indicate that BtCoVs are precursors to previously established coronavirus lineages.

The S gene data set displayed the greatest level of divergence between coronavirus groups and may be approaching sequence saturation. Saturation occurs due to multiple nucleotide substitutions at a given nucleotide position over time, masking the true level of divergence and obscuring deeper phylogenetic relationships to the point of making them unrecoverable (1, 2). Since the HEL domain is conserved across all groups, it will most likely reflect the true coronavirus phylogeny, and while the N gene is less conserved than the HEL domain, the regions used in the analysis were also conserved between groups. As such, this may explain the inconsistency between phylogenetic relationships for the S gene and those for the other genes tested, even following removal of highly variable regions and application of an evolutionary model. Therefore, the HEL domain and, to a lesser extent, the N gene are considered the most suitable for calculation of evolutionary rates and divergence times for coronaviruses.

Estimation of divergence dates.The most recent common ancestor (MRCA) dates for major coronavirus lineages (Table 1) were calculated for the HEL domain, using the uncorrelated relaxed clock model. However, the large HPDs associated with most of these dates necessitate caution in interpreting these results. The MRCA date for coronaviruses was estimated to be 1586 (HPDs, 1102 to 1878), approximately 500 years before the present, while the major divergence between group 1 and groups 2, 4, and 5 was estimated to occur in 1647 (HPDs, 1247 to 1882) (Fig. 2 and Table 2). Among coronavirus groups, the earliest MRCA date estimates were for group 1 coronaviruses, at 1800 (HPDs, 1570 to 1927), and group 2 coronaviruses, at 1892 (HPDs, 1817 to 1945) (Fig. 2 and Table 2). The group 3 avian IBV coronavirus MRCA date was estimated to be 1925 (HPDs, 1887 to 1941), and those for groups 4 and 5 were 1961 (HPDs, 1918 to 1995) and 1955 (HPDs, 1902 to 1991), respectively (Fig. 2 and Table 2). These results suggest that group 1, which includes a high genetic diversity of BtCoVs, may have evolved much earlier than other coronavirus lineages. Within group 4, the estimated time of divergence for SARS-CoVs from civet cats and humans and the closest SARS-like BtCoV isolate (BtCoV/Rp3) was 1986 (HPDs, 1964 to 2001), a mean divergence time of 17 years before the 2003 outbreak. Furthermore, the MRCA of SARS-CoVs from humans and civet cats was estimated to exist in 1999 (HPDs, 1990 to 2003). These results indicate that the precursor of the virus responsible for the SARS outbreak may not have been determined.

Divergence dates of coronavirus lineages represented on a phylogenetic tree generated based on 1,797 nucleotides of the complete HEL domain sequence. Breda virus (AY427798) was used to root the tree. Numbers at branch nodes indicate the times of divergence, calculated using the uncorrelated exponential relaxed-clock model in BEAST v1.4. For an explanation of the letters at branch nodes, see Table 2.

Detection of recombination and lineage-specific selection.We investigated the positive selection pressure along the S gene in an attempt to determine the evolutionary stability of group 4 coronaviruses in different hosts. However, because the presence of recombination among sequences can obscure detection of positive selection, we first tested for recombination in our data set. Five recombination breakpoints were detected in the S gene data set for SARS-CoVs by using GARD (Fig. 3A) (17). However, the Shimodaira-Hasegawa test (29) showed that there was no significant evidence (P > 0.05) of incongruence between phylogenies of the six recombinant fragments, and no interhost recombination was observed in the phylogenies (data not shown). Therefore, recombination among these sequences was considered not significant enough to affect the detection of positive selection along lineages.

Detection of recombination (A) and lineage-specific selection pressure (B) on the spike genes of SARS-CoVs. Mean ω values calculated using GA are presented above branches, with ω values of >1 shown in bold. Numbers below branches indicate an averaged model probability of >1 along specific lineages. Branches with P values of ≥95% are highlighted in gray. The Akaike information criterion (c-AIC) for the best-fitting model was 23,557.

The averaged model probability of branches showed that only two branches were under significant positive selection (≥95% confidence interval), with one leading to civet SARS-CoVs (ω = 2.149) and one leading to the three human isolates (ω = 2.144) (Fig. 3B). Other branches with ω values of >1 were not under significant positive selection pressure. These findings further support the hypothesis that SARS-CoVs were recently introduced into civet and human populations from an unknown host.

Population dynamics of coronaviruses.Group 1 coronaviruses contain two distinct sublineages, A and B, which were treated independently in these analyses (Fig. 1A and C). Sublineage A (group 1A) consists of TGEV, feline coronavirus, and wild animal coronaviruses, while sublineage B (group 1B) consists of all group 1 BtCoVs, PEDV, and human common cold coronaviruses. The best-fitting population models for groups 1A and 1B were exponential growth and a constant population size, respectively, suggesting the recent introduction of coronaviruses into domestic and wild animals from a population endemic in bats (Table 3). Even when the group 1B bat viruses were analyzed separately (group 1B bats), the best-fitting population growth model remained a constant population size (Table 3). This likely reflects the early divergence and establishment of PEDV and human cold coronaviruses from BtCoV populations.

Two sublineages, A and B, within group 4 were also independently analyzed. Sublineage A (group 4A) contains only BtCoVs, while sublineage B (group 4B) includes civet and human SARS-CoVs along with BtCoV/Rp3 (Fig. 2). Groups 4A and 5, which consist only of BtCoVs, had a best-fitting model of constant population size, reflecting the endemicity of coronaviruses in bats despite multiple interspecies transmission events between different bat species (Table 3). The best-fitting model for group 4B, which contains BtCoV/Rp3 and SARS-CoVs, was for exponential population growth, with an extraordinarily high mean growth rate compared to those of all other groups (Table 3). Taken together, these results indicate that coronaviruses are endemic in bat populations and that after interspecies transmission to a naïve host, there was a change to epidemic-like population growth, particularly for group 4B, which contains civet and human SARS-CoVs. However, since coalescent estimates improve with increased sample size, especially for complex demographic models, it will be important to reexamine these results as more CoV data become available.

DISCUSSION

The findings of our study show that coronaviruses from bats have the most genetic diversity and are older than all coronaviruses recognized from any other animal species. We also show that coronaviruses in bats have a stable population size, while in other animals there is epidemic-like population growth. All of these attributes consolidate the concept that bats are the natural reservoir of all currently known coronavirus lineages.

Natural reservoirs for RNA viruses typically harbor the most genetic diversity in viruses among all recognized hosts while showing no symptoms of infection (13). A change from a constant virus population size to logistic or exponential population growth has been associated with the interspecies transmission of viruses from their natural reservoir to an alternate host (28). Previously, limited genetic information made it almost impossible to establish the ecology of coronaviruses. However, the accumulated information presented here allows us to better understand the ecology and evolutionary pathways of coronaviruses (Fig. 4).

Although the ancestor of all of the established coronavirus lineages has not been identified, our data indicate that the hypothetical ancestor is likely from a bat. For group 1, there have been two interspecies transmissions from bats to other animals, including humans, and most interestingly between wild animals and domestic pets. However, no precursor from bats has been identified for these group 1A viruses (Fig. 2 and 4). A similar situation exists for group 2 viruses, with interspecies transmissions between domesticated animals, mice, and humans but with no known precursor virus from bats. This is likely because bat species have only recently been studied, and it is possible that after extensive surveillance the evolutionary history of group 2 coronaviruses can be more clearly understood.

The situation with groups 4 and 5 is better resolved, with a clear evolutionary pathway from BtCoVs to SARS-CoV, albeit through an unknown intermediate host, while group 5 viruses appear to circulate among different species of bat (35). Two distantly related coronavirus lineages (groups 1A and 4A) have been identified from wild animals, with a possible interspecies transmission to domestic animals for group 1A and a jump to humans for group 4B. The exponential population growth in both these groups shows that coronaviruses are not endemic in wild animals and that these animals have acquired the viruses recently. This highlights the importance of further surveillance in a variety of mammals, especially given the emergence of new host populations in China that has resulted from increased consumption of exotic animals in the last 20 years.

Representing more of an enigma are group 3 coronaviruses, which are found exclusively in domestic poultry and are highly divergent from other groups. Given the recent divergence date and exponential population growth, we hypothesize that this group arose from a single introduction from an ancestral BtCoV. A possible transmission pathway from bats to poultry may have occurred via raptors, which are known to prey on bats (3). Obviously, this is speculative and may only be confirmed by virological investigation of bird populations.

The emergence and evolutionary pathway of SARS-CoVs provide a unique opportunity to study the evolutionary dynamics involved with the interspecies transmission of coronaviruses. Comparison of population growth models of group 4A and 4B viruses indicated a shift from a constant population size to epidemic-like growth following interspecies transmission from the viruses' natural reservoir to alternate hosts. An understanding of the ecology and evolution of coronaviruses will provide crucial information to identify viruses with epidemic potential. It is likely that once more information becomes available, ideas regarding coronaviruses will continue to change radically, as they have since 2003.

ACKNOWLEDGMENTS

This work was supported by the Research Fund for the Control of Infectious Diseases of the Health Welfare and Food Bureau of the Hong Kong SAR Government (no. 03040782), by the National Special Task Force Fund for Identification of the Animal Reservoir of SARS-CoV, and by the Li Ka Shing Foundation.

We thank Andrew Rambaut and Alexei Drummond for their help with the BEAST software package.