Significance

The entire history of life is the story of virus–host coevolution. Therefore the origins and evolution of viruses are an essential component of this process. A signature feature of the virus state is the capsid, the proteinaceous shell that encases the viral genome. Although homologous capsid proteins are encoded by highly diverse viruses, there are at least 20 unrelated varieties of these proteins. We show here that many, if not all, capsid proteins evolved from ancestral proteins of cellular organisms on multiple, independent occasions. These findings reveal a stronger connection between the virosphere and cellular life forms than previously suspected.

Abstract

Viruses are the most abundant biological entities on earth and show remarkable diversity of genome sequences, replication and expression strategies, and virion structures. Evolutionary genomics of viruses revealed many unexpected connections but the general scenario(s) for the evolution of the virosphere remains a matter of intense debate among proponents of the cellular regression, escaped genes, and primordial virus world hypotheses. A comprehensive sequence and structure analysis of major virion proteins indicates that they evolved on about 20 independent occasions, and in some of these cases likely ancestors are identifiable among the proteins of cellular organisms. Virus genomes typically consist of distinct structural and replication modules that recombine frequently and can have different evolutionary trajectories. The present analysis suggests that, although the replication modules of at least some classes of viruses might descend from primordial selfish genetic elements, bona fide viruses evolved on multiple, independent occasions throughout the course of evolution by the recruitment of diverse host proteins that became major virion components.

Viruses are the most abundant biological entities on our planet and have a profound impact on global ecology and the evolution of the biosphere (1⇓⇓–4), but their provenance remains a subject of debate and speculation. Three major alternative scenarios have been put forward to explain the origin of viruses (5). The virus-first hypothesis, also known as the primordial virus world hypothesis, regards viruses (or virus-like genetic elements) as intermediates between prebiotic chemical systems and cellular life and accordingly posits that virus-like entities originated in the precellular world. The regression hypothesis, in contrast, submits that viruses are degenerated cells that have succumbed to obligate intracellular parasitism and in the process shed many functional systems that are ubiquitous and essential in cellular life forms, in particular the translation apparatus. Finally, the escape hypothesis postulates that viruses evolved independently in different domains of life from cellular genes that embraced selfish replication and became infectious. The three scenarios are not mutually exclusive, because different groups of viruses potentially could have evolved via different routes. Over the years, all three scenarios have been revised and elaborated to different extents. For instance, the diversity of genome replication-expression strategies in viruses, contrasting the uniformity in cellular organisms, had been considered to be most compatible with the possibility that the virus world descends directly from a precellular stage of evolution (4, 6); the discovery of giant viruses infecting protists led to a revival of the regression hypothesis (7⇓–9); and an updated version of the escape hypothesis states that the first viruses have escaped not from contemporary but rather from primordial cells, predating the last universal cellular ancestor (10). The three evolutionary scenarios imply different timelines for the origin of viruses but offer little insight into how the different components constituting viral genomes might have combined to give rise to modern viruses.

A typical virus genome encompasses two major functional modules, namely, determinants of virion formation and those of genome replication. Understanding the origin of any virus group is possible only if the provenances of both components are elucidated (11). Given that viral replication proteins often have no closely related homologs in known cellular organisms (6, 12), it has been suggested that many of these proteins evolved in the precellular world (4, 6) or in primordial, now extinct, cellular lineages (5, 10, 13). The ability to transfer the genetic information encased within capsids—the protective proteinaceous shells that comprise the cores of virus particles (virions)—is unique to bona fide viruses and distinguishes them from other types of selfish genetic elements such as plasmids and transposons (14). Thus, the origin of the first true viruses is inseparable from the emergence of viral capsids.

Viral capsid proteins (CPs) typically do not have obvious homologs among contemporary cellular proteins (6, 15), raising questions regarding their provenance and the circumstances under which they have evolved. One possibility is that genes encoding CPs have originated de novo within the genomes of nonviral selfish replicons by generic mechanisms, such as overprinting and diversification (16). Alternatively, these proteins could have first performed cellular functions, subsequently being recruited for virion formation. For instance, it has been proposed that virus-like particles could have served as gene-transfer agents in precellular communities of replicators (17). Another possibility is that virus-like particles evolved in the cellular context as micro- and nanocompartments, akin to prokaryotic carboxysomes and encapsulins, for the sequestration of various enzymes for specialized biochemical reactions (18, 19).

Studies on the origin of viral capsids are severely hampered by the high sequence divergence among these proteins. Nevertheless, numerous structural comparisons have uncovered unexpected similarities in the folds of CPs from viruses infecting hosts from different cellular domains, testifying to the antiquity of the CPs and the evolutionary connections between the viruses that encode them (20⇓⇓–23). It also became apparent that the number of structural folds found in viral CPs is rather limited. For instance, viruses with dsDNA genomes from 20 families have been shown to possess CPs with only five distinct structural folds (23). Here, to investigate the extent of the diversity and potential origins of viral capsids and to gain further insight into virus origins and evolution, we performed a comparative analysis of the major structural proteins across the entire classified virosphere and made a focused effort to identify cellular homologs of these viral proteins.

Results and Discussion

A Comprehensive Census of Viral Capsid and Nucleocapsid Proteins.

Viruses display remarkable diversity in the complexity and organization of their virions. With few exceptions, nonenveloped virions are constructed from one major capsid protein (MCP), which determines virion assembly and architecture, and one or a few minor CPs. By contrast, enveloped virions often contain nucleocapsid (NC) proteins which form nucleoprotein complexes with the respective viral genomes, matrix proteins linking the nucleoprotein to the lipid membrane, and envelope proteins responsible for host recognition and membrane fusion. These proteins often constitute a considerable fraction of the virion mass, making it challenging to single out the major virion protein. Nevertheless, NC proteins of some enveloped viruses are homologous to the MCPs of nonenveloped viruses [e.g., nonenveloped tenuiviruses and enveloped phleboviruses (24)] and thus are considered to be functionally equivalent herein.

Analysis of the available sequences and structures of major CP and NC proteins encoded by representative members of 135 virus taxa (117 families and 18 unassigned genera; Table S1) (25, 26) allowed us to attribute structural folds to 76.3% of the known virus families and unassigned genera. The remaining taxa included viruses that do not form viral particles (3%) and viruses for which the fold of the major virion proteins is not known and could not be predicted from the sequence data (20.7%). The former group includes capsidless viruses of the families Endornaviridae, Hypoviridae, Narnaviridae, and Amalgaviridae, all of which appear to have evolved independently from different groups of full-fledged capsid-encoding RNA viruses (27⇓–29). The latter category includes eight taxa of archaeal viruses with unique morphologies and genomes (30), pleomorphic bacterial viruses of the family Plasmaviridae, and 19 diverse taxa of eukaryotic viruses (Table S1). It should be noted that, with the current explosion of metagenomics studies, the number and diversity of newly recognized virus taxa will continue to rise (31). Although many of these viruses are expected to have previously observed CP/NC protein folds, novel architectural solutions doubtlessly will be discovered as well.

The 76.3% of viral taxa for which the fold of the major virion proteins was defined could be divided into 18 architectural classes (Fig. 1), also referred to as “structure-based viral lineages” (20, 23). These architectural classes were unevenly populated by viral taxa: Seven major architectural classes covered 64.4% of the known virosphere. Of the remaining 11 minor classes, seven contained folds unique to a single virus family, three folds were found in two families each, and the fold specific to the NC protein of members of the order Nidovirales was conserved in viruses from three families (Fig. 1).

Among viral taxa for which the fold of the major virion proteins could be defined, 73 taxa (71%) have icosahedral virions, and 29 taxa (29%) include viruses with helical (nucleo)capsids. By contrast, among viral taxa with unknown CP/NC protein folds, nearly half (13 taxa) contain viruses with helical nucleoprotein complexes, whereas those with icosahedral capsids belong to only six taxa (21%); the rest of these taxa include viruses with bacilliform, droplet-shaped, pleomorphic, spherical, bottle-shaped and spindle-shaped virions (Table S1). Thus, the CP structures of viruses with icosahedral capsids seem already to have been sampled to considerable depth, whereas viruses with helical (nucleo)capsids are understudied and might be found to have novel structural folds in the future.

Icosahedral capsids of characterized viruses are constructed from CPs with 10 remarkably diverse structural folds, which range from exclusively α-helical to β-strand–based. The inherent ability of so many structurally unrelated proteins to assemble into icosahedral particles refutes the argument that the structural similarity between the MCPs of viruses infecting hosts in different domains of life is a result of convergent evolution, whereby the sheer geometry of the icosahedral shell constrains the evolution of a CP to a particular fold (32). Interestingly, the unrelated folds of the viral proteins that form helical (nucleo)capsids are largely α-helical. The reasons for such a bias are unclear, given that cellular helical filaments, such as certain bacterial pili, can be formed from β-strand–based proteins (33). Here we present a focused attempt to infer the likely evolutionary ancestry of the different classes of major virion proteins including CP, NC, and matrix proteins.

Origins of Viral Structural Proteins.

Origins of viral (nucleo)capsids is one of the key unanswered questions in virus evolution. To understand the provenance of major proteins constituting viral particles, we performed systematic comparisons of viral proteins with the global database of protein sequences and structures.

Jellyroll fold.

The single jellyroll (SJR) is the most prevalent fold among viral CPs, representing ∼28% of the CPs in the analyzed set of virus taxa (Fig. 1). High-resolution CP structures are available for viruses from 23 of the 38 taxa with SJR CPs (Table S1). Searches against the Protein Data Bank (PDB) database using the DALI server (34) seeded with representative CP structures resulted in multiple matches to cellular proteins containing the SJR-fold domains. Indeed, SJR proteins are widespread in organisms from all three cellular domains and are functionally diverse. However, most of those with the highest similarity to viral CPs can be classified into four major groups (Fig. 2). One of the most common functions of SJR domains in cellular proteins is carbohydrate recognition and binding. Accordingly, SJR domains are often appended to various carbohydrate-active enzymes, such as glycoside hydrolases (35). For example, a search seeded with the CP of satellite panicum mosaic virus (PDB ID code: 1STM) retrieved the carbohydrate-binding module from Ruminococcus flavefaciens (PDB ID code: 4D3L) with a highly significant DALI Z score of 7.9 (Fig. 2) despite the lack of appreciable sequence similarity. Similar results were obtained when searches were initiated with CPs from other virus families. Another family of cellular SJR proteins includes the P domain found in archaeal, bacterial, and eukaryotic subtilisin-like proteases, in which this domain is thought to assist in protein stabilization (36, 37). The P domain of Saccharomyces cerevisiae protease Kex2 (PDB ID code: 1R64) was retrieved with the CP of tobacco streak virus (Bromoviridae) (PDB ID code: 4Y6T) with a Z score of 5.8 (Fig. 2). The third family of viral CP-like SJR proteins includes nucleoplasmins and nucleophosmins (Fig. 2), molecular chaperones that bind to core histones and promote nucleosome assembly in eukaryotes (38). The core SJR domain of nucleoplasmins/nucleophosmins forms stable pentameric and decameric complexes that serve as platforms for binding histone octamers (39, 40). Hits to nucleoplasmins/nucleophosmins were obtained with different CPs, including the MCP of dsDNA bacteriophage P23-77 (PDB ID code: 3ZMO) of the family Sphaerolipoviridae (hit to PDB ID code: 2P1B; Z score, 5.2).

Viral and cellular SJR proteins. (A) A selection of viral SJR CP structures. The rightmost structure corresponds to the virion of STNV. (B) A selection of cellular SJR protein structures. The rightmost structure corresponds to the 60-subunit virion-like assembly of the human sTALL-1 protein. All structures are colored using the rainbow scheme from blue (N terminus) to red (C terminus). The linker region leading to the DNA-binding domain in AraC is shown in gray. (C) Relationships between cellular and viral SJR proteins. The matrix and cluster dendrograms are based on the pairwise Z score comparisons calculated using DALI. For the complete matrix, see Dataset S1. The color scale indicates the corresponding Z scores. RNA viruses are shown in green, ssDNA viruses in blue, and dsDNA viruses in red. All compared structures are indicated with the corresponding PDB identifiers. CBM, carbohydrate-binding module; NP, nucleoplasmin/nucleophosmin; PCV2, porcine circovirus 2; PRO-P, P domain of subtilisin-like proteases; SPMV, satellite panicum mosaic virus; STMV, satellite tobacco mosaic virus; TSV, tobacco streak virus.

The fourth broad group of cellular proteins with CP-like SJR domains comprises cytokines of the TNF superfamily. TNF-like ligands and their corresponding receptors play pivotal roles in mammalian cell host-defense processes, inflammation, apoptosis, autoimmunity, and organogenesis (41). The biologically active form of TNF-like proteins is a trimer. Remarkably, however, soluble tumour necrosis factor- and Apo-L-related leucocyte-expressed ligand-1 (sTALL-1), a member of the TNF superfamily, has been shown to form 60-subunit (20 trimers) virus-like particles (42) that superficially resemble 60-subunit T = 1 virions (12 pentamers) of satellite tobacco necrosis virus (STNV) (Fig. 2 A and B). Furthermore, sTALL-1 is identified as a structural homolog of STNV CP with a DALI Z score of 7.7. Importantly, TNF-like proteins are not exclusive to eukaryotes but also are prevalent in bacteria. Perhaps most notable among these is the Bacillus collagen-like protein of anthracis (BclA) protein found in the outermost surface layer of Bacillus anthracis spores (43). A DALI search with the CP of cowpea mosaic virus (family Secoviridae; PDB ID code: 1NY7) retrieved BclA (PDB ID code: 3AB0) with a Z score of 5.4.

More divergent SJR domains are found in functionally diverse proteins of the Cupin superfamily (44). Although the conserved structural core in this superfamily consists of six β-strands, some members, such as oxygenases and Jumonji C (JmjC) domain-containing histone demethylases, contain eight antiparallel β-strands (45, 46). The Cupin superfamily includes bacterial transcription factors related to the arabinose operon regulator, AraC, in which the N-terminal SJR Cupin domain (Fig. 2B) is responsible for arabinose binding and dimerization and is fused to the C-terminal helix-turn-helix (HTH) DNA-binding domain (44). Searches seeded with AraC from Escherichia coli (PDB ID code: 2ARC) resulted in a match to the CP of San Miguel sea lion virus (family Caliciviridae) (PDB ID code: 2GH8) with a Z score of 2.7.

The ubiquity and functional diversity of cellular SJR proteins testifies to their antiquity. Indeed, it is highly probable that cellular proteins with the SJR fold had experienced substantial diversification before the emergence of the last universal cellular ancestor. Given the structural similarity between the cellular and viral SJR proteins and that some of these cellular proteins, such as the TNF superfamily, are capable of forming assemblies resembling virus-like particles (Fig. 2B), it is likely that the ancestor of the viral SJR CP evolved through recruitment of a cellular SJR protein. The original function of this protein could have involved recognition of carbohydrates. A protein with such a property would be immediately beneficial to the virus because, in addition to providing a protective shell for the genome, it could ensure specific binding of the viral particle to the host cell. It is noteworthy that many contemporary viruses bind directly to various glycan receptors on the surface of their hosts via the SJR CPs (47). The alternative possibility, that viral CPs gave rise to cellular SJR proteins, appears less likely, given the wide taxonomic distribution and functional diversity of SJR proteins in all three domains of cellular life, in sharp contrast to the scarcity of prokaryotic viruses with SJR CPs.

Transformation of a cellular protein into a bona fide CP would necessitate specific recognition and encapsidation of the viral genome. This function typically is performed by terminal extensions appended to the SJR core. For instance, some ssRNA and ssDNA viruses (e.g., tombusviruses and circoviruses, respectively) have largely unstructured, positively charged N-terminal domains that interact with the nucleic acids (48, 49).

Clustering of the SJR proteins with DALI, based on a pairwise comparison of the Z scores, suggests that the CPs from the majority of RNA viruses and eukaryotic ssDNA viruses form a monophyletic group (Fig. 2C and Dataset S1). Notably, circoviral CPs are nested among RNA viruses, as is consistent with the previously proposed scenario in which the CP genes of some eukaryotic ssDNA viruses have been horizontally acquired from ssRNA viruses (50⇓⇓⇓⇓–55). The compact CPs of bromoviruses (Fig. 2A) cluster with the P domain of Kex2-like subtilisin proteases, separately from other CPs (Fig. 2C and Dataset S1), whereas the CPs of bacterial microviruses (ssDNA genomes) appear to be more closely similar to the TNF-like proteins than to other viral CPs. The most divergent among viral SJR proteins, embellished with extended loops, are CPs of parvoviruses, polyomaviruses, and papillomaviruses. The CPs from the two latter virus groups form a clade separate from other viral CPs (Fig. 2C and Dataset S1). However, because of the high divergence of these proteins, their affinities are difficult to ascertain. Concurrently, among the cellular SJR proteins, cupins show the least similarity to other cellular and viral SJR proteins.

These observations suggest that viral SJR CPs could have evolved from bona fide cellular proteins, possibly on several independent occasions. However, we may never be able to pinpoint the exact family(ies) of cellular SJR proteins at the origin of the viral CPs with confidence because of the high evolution rates in viral genomes.

Double jellyroll CPs.

The second largest architectural class includes viruses with the double jellyroll (DJR) MCP, which consists of two consecutive jellyroll domains and is found in ∼10% of the virus taxa (Fig. 1). Unlike SJR CPs, the DJR β-strands are oriented vertically with respect to the capsid surface (20, 21, 56). The DJR MCPs are exclusive to dsDNA viruses that are classified into 13 taxa and infect hosts from all three cellular domains (Table S1). This architectural class includes members of the bacterial virus families Corticoviridae and Tectiviridae, archaeal viruses of the family Turriviridae, and eukaryotic viruses of the families Adenoviridae and Lavidaviridae as well as the proposed order Megavirales that includes most of the large and giant eukaryotic viruses. Viruses with DJR MCPs are evolutionarily linked to a large group of unclassified eukaryotic endogenous viruses/transposons called “Polintoviruses/Polintons” (mavericks) that also encode a typical DJR protein, although the formation of virions remains to be demonstrated (57⇓–59).

A straightforward evolutionary scenario proposes that the ancestral DJR MCP derives from a SJR CP via gene duplication (21, 60). Bacterial and archaeal viruses of the Sphaerolipoviridae family (61) display a potentially archaic virion architecture that might have given rise to the one observed in the DJR MCP viruses (62, 63). All sphaerolipoviruses encode two MCPs, each with the SJR fold, that form homo- and heterodimers involved in the formation of the icosahedral capsid with vertical orientation of the β-strands similar to the orientation in DJR MCPs. Consistent with the proposed SJR CP gene-duplication event in the evolution of the DJR ancestor, the two sphaerolipoviral CPs are most similar to each other among known protein structures (Fig. 2C and Dataset S1). Furthermore, sphaerolipoviruses and DJR viruses share genome-packaging ATPases of the A32-like family (named after the respective protein of vaccinia virus) (64) that thus far have not been found in viruses with other MCP types. Based on these common characteristics, we include sphaerolipoviruses in the architectural class of viruses encoding the DJR MCPs (Fig. 1). Notably, the two SJR CPs of sphaerolipoviruses cluster with nucleoplasmins/nucleophosmins, separately from other viral SJR CPs (Fig. 2C and Dataset S1), suggesting an independent origin from a cellular SJR ancestor.

HK97-like MCP fold.

The second major structural fold found in MCPs of dsDNA viruses is exemplified by and named after the gp5 protein of bacteriophage HK97 (65). This fold is characteristic of the MCPs of bacterial and archaeal members of the order Caudovirales (families Myoviridae, Siphoviridae, and Podoviridae) (66), one of the most abundant, widespread, and diverse groups of viruses on the planet (2, 3, 67). The HK97-fold is also found in the floor domain of the MCP of herpesviruses (order Herpesvirales) (68). In addition to homologous MCPs, herpesviruses and tailed prokaryotic dsDNA viruses share closely similar mechanisms of virion assembly, maturation, and genome packaging, indicating that at least the morphogenetic modules of the two groups evolved from a common ancestor (23, 56, 68, 69). Outside the virosphere, the HK97-like fold is found only in encapsulins, a class of bacterial and archaeal nanocompartments that encapsulate a variety of cargo proteins related to oxidative stress response, including ferritin-like proteins and DyP-type peroxidases (19). High-resolution structures are available for three encapsulins (Fig. S1), which, similar to viruses, assemble into icosahedral T = 1 or T = 3 cages (70⇓–72). Structural comparison of the available cellular and viral proteins with the HK97-like fold showed that bacterial and archaeal encapsulins form a tight, apparently monophyletic cluster, whereas viral MCPs are more divergent (Fig. S1). This observation, along with the ubiquity of tailed dsDNA viruses, as opposed to the more narrow spread of encapsulins, might be interpreted as an indication of the viral origin of encapsulins via domestication of the HK97-like MCP. Notably, however, encapsulins are also encoded in certain groups of archaea, namely the phylum Crenarchaeota (73), which are not known to be parasitized by members of the Caudovirales. Thus, given our limited knowledge about the structural diversity and taxonomic distribution of encapsulins, the exact evolutionary relationship between encapsulins and MCPs remains unresolved.

Comparisons of cellular and viral proteins with the HK97-like fold. (A) Matrix based on the pairwise comparison of Z scores calculated using DALI. The color scale indicates the corresponding Z scores. (B) A collection of structures of encapsulins (E, Upper Row) and major capsid proteins of members of the order Caudovirales (C, Lower Rows). All structures are colored using the rainbow scheme from blue (N terminus) to red (C terminus), and the corresponding PDB identifiers are shown.

Chymotrypsin-like protease fold.

Viruses of the genus Alphavirus (family Togaviridae) present another example of a CP that is evolutionarily related to cellular proteins. It has been noticed previously that alphavirus core (C) protein shares sequence similarity with chymotrypsin-like serine proteases (74, 75). The C protein consists of two domains: the largely unstructured, positively charged N-terminal region responsible for RNA binding and the C-terminal protease domain, which forms the icosahedral capsid shell located under the glycoprotein-containing envelope. In addition to its structural role in capsid formation, the protein acts as a protease and cleaves off the C protein from the polyprotein precursor. Following cleavage, the C terminus of the C protein inhibits its own protease activity (76). Structural studies have unequivocally shown that the C protein adopts the chymotrypsin-like fold (76). Strikingly, the closest homologs of alphaviral C protein (PDB ID code: 1WYK) are encoded by members of the family Flaviviridae (77, 78). The protease NS3 of Hepatitis C virus (PDB ID code: 1RGQ) is recovered with a DALI Z score of 10.8, and the protease HtrA from humans (PDB ID code: 3NWU) follows with a Z score of 10.6 (Fig. S2). The NS3 protease does not play a structural role in virion formation of flaviviruses but rather is responsible for the proteolytic processing of the polyprotein at several sites to produce mature viral proteins (77). In addition to the related proteases, flaviviruses and alphaviruses encode homologous class II envelope glycoproteins, which form the icosahedral shells around the membrane (79). However, unlike alphaviruses, flaviviruses do not form internal icosahedral capsids, and their C protein has a unique α-helical fold (80). The parsimonious scenario for the origin of the alphaviral C protein includes partial refunctionalization of the viral nonstructural protease, such as the flavivirus NS3, which itself evolved from the HtrA-like cellular protease. The key adaptation in this process was the addition of a positively charged N-terminal region to the protease domain, enabling the protein to bind viral RNA. Another notable difference between C protein and NS3 (and cellular proteases) is the absence of a conserved C-terminal α-helix in the C protein (Fig. S2). Remarkably, it has been recently demonstrated that the C protein of alphaviruses is not essential for virion formation (81). Deletion of the C gene results in the production of infectious, pleomorphic membrane vesicles decorated with viral glycoproteins and carrying the viral genome. This observation implies that the C protein is a relatively recent elaboration in alphaviral virions that is not central for virus propagation.

Comparison of the alphaviral capsid protein (Left) with the nonstructural protease NS3 of hepatitis C virus (HCV) (Center) and human chymotrypsin-like protease HtrA1 (Right). All structures are colored using the rainbow scheme from blue (N terminus) to red (C terminus), and the corresponding PDB identifiers are shown.

Helical nucleocapsids.

Strikingly, in the course of evolution, endonucleases appear to have been recruited to function as viral nucleocapsid proteins. The racket-shaped NC protein of nairoviruses (proposed family Nairoviridae) contains head and stalk domains (Fig. S3). It has been shown that the head domain of the Crimean-Congo hemorrhagic fever virus (CCHFV) nucleocapsid has a metal-dependent DNA-specific endonuclease activity (82). However, the protein does not display any recognizable similarity to known cellular nucleases. The NC protein of arenaviruses, such as Lassa mammarenavirus (LASV), which contains a head domain similar to that of nairoviruses (Fig. S3), instead displays a dsRNA-specific 3′–5′ exonuclease activity (83, 84). In the latter case, the activity is conferred not by the head domain but by the dedicated C-terminal domain homologous to various exonucleases of the DEDDh superfamily (named after four invariant acidic residues, DEDD, in the active site) (Fig. S3). It seems highly probable that the arenaviral nucleocapsid evolved from a nairoviral-like ancestor by acquiring the host-derived exonuclease domain.

Comparison of the NC proteins from Crimean-Congo hemorrhagic fever virus (CCHFV) and Lassa mammarenavirus (LASV) and cellular exonucleases of the DEDDh superfamily. DNAP III exo is the proofreading exonuclease subunit of E. coli DNA polymerase III. The PDB identifiers of all structures are shown. The exonuclease domains are colored using the rainbow scheme from blue (N terminus) to red (C terminus).

In a case coming from a completely different part of the virosphere, evolution of a viral nucleocapsid from a nuclease has been also recently demonstrated for the enveloped filamentous archaeal virus, Thermoproteus tenax virus 1 (TTV1). Sequence analysis suggests that one of the two major NC proteins of TTV1 is a truncated and inactivated derivative of the CRISPR-associated nuclease Cas4, a component of adaptive CRISPR-Cas immune systems (85). Thus, it appears that during virus evolution cellular proteins involved in nucleic acid metabolism, nucleases in particular, have been recruited to function as structural components of the virion on several independent occasions.

Retroviral Gag polyprotein.

In all retroviruses, the structural polyprotein, group-specific antigen (Gag), is proteolytically processed into matrix (MA), capsid (CA), and NC proteins (Fig. 3A), but some viruses contain additional domains, such as p6 in HIV-1 (86). The MA is typically myristoylated at the N terminus and is required for Gag transport and subsequent binding to the cytoplasmic membrane; the CA and NC are both required for Gag multimerization and for the formation of immature spherical particles (87). Several high-resolution structures are available for all three major Gag domains from various retroviruses. Analysis of the retroviral MA structures reveals an α-helical fold that is remarkably similar to that of the N-terminal HTH DNA-binding domain found in various integrases of the tyrosine recombinase superfamily. A DALI search seeded with the MA of mouse mammary tumor virus (MMTV) (PDB ID code: 4ZV5) resulted in a significant hit (Z score, 4.8) to the HTH domain from the integron integrase of Vibrio cholerae (PDB ID code: 2A3V) (Fig. 3B). Notably, upon virus entry into the host cell and following reverse transcription, HIV-1 MA becomes a component of the preintegration complex and binds to dsDNA (88). Accordingly, the retroviral matrix displays not only structural but also functional similarity to the DNA-binding HTH domains and in all likelihood was exapted from this source.

Cellular homologs of the retroviral proteins constituting the Gag polyprotein. (A) Proteolytic processing of the retroviral Gag polyprotein into MA, CA, and NC proteins. (B) Structural comparison of the matrix protein (Upper) of mouse mammary tumor virus (MMTV) with the N-terminal DNA-binding domain (Lower) of the tyrosine recombinase of V. cholerae. (C) Structural comparison of a dimer of the CA C-terminal domain (CA-CTD) (Upper) of HIV-1 with the dimer of the human SCAN domain protein (Lower). (D) Structural comparison of the NC protein (Upper) of HIV-1 with the human pluripotency factor Lin28 (Lower). All structures are colored using the rainbow scheme from blue (N terminus) to red (C terminus), and the corresponding PDB identifiers are shown.

Within the virosphere, the N-terminal domain of the CA protein appears to be unique to reverse-transcribing viruses. By contrast, a domain homologous to the C-terminal domain of the CA protein is commonly found in vertebrate transcription factors and is known as the “SCAN domain” (PF02023), a protein-interaction module that mediates self-association or selective association with other proteins (89). The SCAN domain is always accompanied by multiple C2H2 zinc fingers and/or Krüppel-associated box (KRAB) domains, none of which are of retroviral origin (90). The crystal structure of the SCAN dimer from the human ZNF174 protein indicates that this protein is indeed a domain-swapped homolog of the C-terminal domain of the retroviral CA protein (Fig. 3C) (91). It was previously concluded that known SCAN domains have been recruited from retrotransposons at or near the root of the tetrapod animal branch (89, 90). However, given the generic utility of the SCAN domain for protein dimerization in both viruses and hosts (for functions unrelated to virion formation), the exact provenance of the ancestral SCAN-like dimerization domain remains uncertain.

The NC protein contains one or two CCHC Zn-knuckle motifs and binds the viral genome (87). An HHpred analysis of the NC protein sequence from HIV-1 showed that it is closely related to other Zn-knuckle domain proteins, most notably pluripotency factor Lin28 (probability = 99.3%) and Air2p, a substrate recognition component of a polyA RNA polymerase (probability = 99.2%). Comparison of the HIV-1 NC protein (PDB ID code: 1A1T) and human Lin28 (PDB ID code: 2LI8) further underscores the close structural similarity between the two proteins (Fig. 3D). Thus, at least two of the three major building blocks of retroviral virions are likely to have evolved from cellular proteins.

MA protein of arenaviruses.

The matrix protein, Z, of arenaviruses performs multiple functions, one of which is to bridge the viral surface glycoprotein, the viral ribonucleoprotein, and the host cell budding machinery (92). Similar to the retroviral MA, the N terminus of Z protein is myristoylated, facilitating its membrane anchoring and intracellular targeting, self-assembly, and interaction with other viral proteins. Structural studies have shown that LASV Z protein contains a typical Zn-binding RING domain (93), and an HHpred search retrieves with high probability (99.1%) E3 ubiquitin ligases and other cellular RING domains proteins as close homologs of the LASV Z (Fig. S4). Pairwise structural comparison of the LASV Z protein with the RING domain of human E3 ubiquitin ligase (PDB ID code: 4V3L) using DALI returned a Z score of 4.3. Notably, BLASTP searches seeded with the LASV Z protein yielded multiple significant hits to cellular multidomain proteins. For instance, a protein from plants (XP_018837173) was retrieved with the E value of 8e-05 and showed 37% identity to the LASV Z. Considering that among viral matrix proteins the RING domain is restricted to arenaviruses but otherwise is widespread in eukaryotic proteins, it is highly probable that a cellular RING domain protein has been exapted as the matrix protein in the ancestor of arenaviruses, following or concomitant with its diversification from bunyaviruses.

Comparison of the matrix protein Z of Lassa mammarenavirus (LASV) and the RING domain of ubiquitin (Ub) ligase E3. Both structures are colored using the rainbow scheme from blue (N terminus) to red (C terminus), and the corresponding PDB identifiers are shown.

Matrix proteins of mononegaviruses.

Like many other enveloped viruses, members of the order Mononegavirales encode matrix proteins that direct virion assembly and budding. Structures of the matrix proteins are available for mononegaviruses of the families Filoviridae, Bornaviridae, Paramyxoviridae, Pneumoviridae, and Rhabdoviridae. The rhabdovirus matrix protein has a unique fold and is unrelated to the matrix proteins of other mononegaviruses (94). The latter are homologous to each other and consist of one (bornaviruses) or two (filoviruses, paramyxoviruses, and pneumoviruses) domains with similar β-sandwich folds, suggesting gene duplication during evolution (95, 96). Analysis of the mononegaviral (except for rhabdoviral) matrix protein structures uncovered unexpected similarity to cyclophilins. Cyclophilins are ubiquitous cellular proteins that possess peptidyl-prolyl-isomerase activity and participate in protein folding; these proteins also are receptors for the immunosuppressive drug cyclosporin A, which gave them their name (97). Fig. 4 shows a comparison between the N-terminal domain of the Ebola virus (EBOV) matrix protein and cyclophilin C (CypC). The match between the EBOV matrix protein and CypC was obtained with the low but significant Z score of 2.3. It should be noted that in the same search seeded with the EBOV matrix protein homologs from other mononegaviruses were obtained with similarly low Z scores. For instance, matrix proteins from human respiratory syncytial virus (family Pneumoviridae) (PDB ID code: 2VQP) and Borna disease virus (Bornaviridae) (PDB ID code: 3F1J) were matched to the EBOV matrix protein with the Z scores of 3.5. Nevertheless, visual inspection of the matrix and CypC proteins further confirmed the validity of the DALI matches. The main difference between the EBOV matrix protein and CypC is the presence of an additional β-hairpin in the structure of the latter protein (Fig. 4). Matches to cyclophilins were also obtained when DALI searches were seeded with structures of matrix proteins from other mononega viruses (Z scores of 2.5–3.1). Notably, cyclophilins are known to play an important role in viral infections (98). In particular, cyclophilin A (CypA) is incorporated into virions by binding to capsids or nucleocapsids of many unrelated viruses, including HIV-1 (Retroviridae), vesicular stomatitis virus (Rhabdoviridae), vaccinia virus (Poxviridae), and severe acute respiratory syndrome coronavirus (SARS-CoV; Coronaviridae) (98). All members of the Mononegavirales, including rhabdoviruses, encode homologous NC proteins (99), the unrelated matrix proteins notwithstanding. Binding of CypA to the nucleocapsid of rhabdoviruses (100) resembles the interaction between the nucleocapsid and the matrix protein of other mononegaviruses (101). Thus, the matrix protein-encoding gene of mononegaviruses likely evolved from a cyclophilin gene acquired from the host. The alternative possibility, i.e., that matrix protein of mononegaviruses is at the origin of cyclophilins, is hardly possible, given the ubiquity of the latter in cellular organisms and its scarcity among viruses. The cellular cyclophilin that gave rise to the matrix protein might have interacted with the viral nucleocapsid, as in the case of rhabdoviruses. Given that, in RNA-dependent RNA polymerase (RdRp)-based phylogenies, rhabdoviruses do not occupy the basal position within the order Mononegavirales (102), it is likely that the ancestral cyclophilin-like matrix protein-encoding gene was replaced in the rhabdovirus ancestor by a nonhomologous gene with similar properties. It would be interesting to test whether any of the matrix proteins of mononegaviruses retained the peptidyl-prolyl-isomerase activity typical of cyclophilins.

Structural comparison of the Ebola virus MA protein with CypC. Topology diagrams (Left) and structural models (Right) are colored using the rainbow scheme from blue (N terminus) to red (C terminus), and the corresponding PDB identifiers are shown. The β-hairpin insert in CypC is colored black.

Given the structural similarities between other cellular proteins and viral CPs, a scenario emerges in which bona fide viruses evolved on multiple, independent occasions by recruiting diverse host proteins that became major virion components (Fig. 5).

A scenario for the origin of viruses from selfish replicators upon acquisition of capsid protein genes from cellular life forms at different stages of evolution.

Concluding Remarks

The findings on the apparent independent recruitment of diverse proteins from cellular organisms for the role of CPs and other major virion proteins compel us to adjust our concept of the virus world (4, 6). The grand scenario for virus evolution becomes a hybrid between the virus-first and escape hypotheses (Fig. 5). Given the lack of close cellular homologs for the hallmark virus proteins involved in genome replication, the diversity of viral genomic strategies, and the general considerations on the early stages in the evolution of replicating genomes that implicate ensembles of small, partially autonomous, virus-like genetic elements (103), the origin of viral replicative modules seems likely to hark all the way back to the precellular era. At that stage, some of these primordial replicators coalesced and gave rise to the first cellular genomes, whereas others became genetic parasites. Conceivably, however, such parasites gave rise to true viruses only after the emergence of cells. Viruses emerged through the recruitment of cellular carbohydrate- or nucleic acid-binding proteins as CPs and other major virion proteins. Given the simple, symmetrical, thermodynamically favored structures of the widespread capsids, such as icosahedra or helices, the structural requirements for such exaptation might not have been prohibitive. Indeed, this view is compatible with the scenario of multiple recruitment events occurring throughout the course of evolution of life. Some of the CPs were coopted at the earliest stages of cellular evolution, as is likely to have been the case for the SJR and DJR folds. Other structural proteins were likely adapted at the root of particular cellular domains, as is likely the case of retroviral Gag, given the wide spread and diversity of retroviruses in Eukarya, in sharp contrast to their absence in bacteria and archaea (11). Finally, in all likelihood, some virion components have evolved rather recently, e.g., protease recruitment for the alphavirus capsids or the RING domain and cyclophilin exaptation as the matrix proteins of arenaviruses and mononegaviruses, respectively. Notably, virus-like structures also appear to have evolved in the cellular context, e.g., bacterial microcompartments, large icosahedral organelles that, unlike encapsulins, are built from proteins with no identifiable homologs in the viral world (18). The evolution of virions certainly is not a one-way street. Once multiple capsids evolved, viral structural proteins were recruited for cellular functions on multiple occasions. A well-known example is the exaptation of retroviral envelope proteins for the role of mammalian placental receptors, syncytins (104). The history of virions can be considered the ultimate manifestation of virus–host coevolution.

Materials and Methods

Sequence and Structural Data.

To analyze the diversity of folds in the major virion proteins, structural and sequence information for representative proteins was collected from all currently recognized viral families and unassigned genera (25). The list of approved virus taxa was downloaded from the International Committee on the Taxonomy of Viruses (ICTV) website (https://talk.ictvonline.org/files/master-species-lists/). In accordance with the recently proposed taxonomy, which has been approved by the Executive Committee of the ICTV (105), different genera within the family Bunyaviridae were considered as separate families. In total, the analyzed dataset covered 135 virus taxa (117 families and 18 unassigned genera). The genera Dinodnavirus and Rhizidiovirus were excluded from the analysis because of the complete lack of available sequence or structural information. The protein structures and sequences were downloaded from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (www.rcsb.org) and the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/protein/), respectively.

Sequence and Structure Analysis.

Structure-based searches were performed using the DALI server (34, 106). Structural similarities between cellular and viral proteins were evaluated based on the DALI Z score, which is a measure of the quality of the structural alignment. Z scores above 2, i.e., two SDs above expected, are usually considered significant (107). The relevance of the matches was evaluated further by visual inspection of structural alignments between the cellular and viral proteins. Structural homologs were additionally searched for using the TopSearch server (https://topsearch.services.came.sbg.ac.at/). Structural similarity matrices from all-against-all structure comparisons as well as corresponding dendrograms were obtained using the latest release of the DALI server (34). Structures were aligned using the MatchMaker algorithm implemented in University of California, San Francisco (UCSF) Chimera (108) and were visualized using the same software. Sequence-similarity searches were performed using PSI-BLAST (109) against the nonredundant protein sequence database at the NCBI. For distant sequence similarity detection, homologous sequences of viral proteins were aligned using MUSCLE (110), and the resulting multiple sequence alignments (or individual sequences) were used as seeds in profile-against-profile searches using HHpred (111).

Footnotes

↵1To whom correspondence may be addressed. Email: koonin{at}ncbi.nlm.nih.gov or krupovic{at}pasteur.fr.

Author contributions: M.K. and E.V.K. designed research; M.K. performed research; M.K. and E.V.K. analyzed data; and M.K. and E.V.K. wrote the paper.

Reviewers: C.M.L., Montana State University; and K.S., Portland State University.