Abstract

Current understanding of the higher order systematics of eukaryotes relies largely on analyses of the small ribosomal subunit RNA (SSU rRNA). Independent testing of these results is still limited. We have combined the sequences of four of the most broadly taxonomically sampled proteins available to create a roughly parallel data set to that of SSU rRNA. The resulting phylogenetic tree shows a number of striking differences from SSU rRNA phylogeny, including strong support for most major groups and several major supergroups.

SSU rRNA sequences constitute the single most comprehensive database available for phylum-level systematics (1–4). These data depict the eukaryotes as a series of deeply diverging lineages branching successively toward a dense unresolved cluster [the so-called eukaryote crown (5)]. Because the latter include the majority of eukaryotes, this has led to suggestions that most major eukaryote taxa arose in a single explosive radiation (5,6), and, together with poor resolution in many protein-based phylogenies, to speculation that relationships among these taxa may never be resolved (6). Although phylogenies of protein genes and rRNAs often conflict, currently available protein data are plagued by uneven taxonomic sampling, wide disparities in evolutionary rates among lineages, and/or inadequate characterization. The first two phenomena tend to cause the artifactual clustering of long branches (7), whereas the latter may result in failure to detect taxonomically confounding gene paralogies or lateral gene transfers.

We attempted to overcome these problems and to a create a well-resolved parallel phylogeny to that of SSU rRNA by combining the deduced amino acid sequences of four protein-encoding genes. The encoded proteins—α-tubulin, β-tubulin, actin, and elongation factor 1–alpha (EF-1α)—are the only proteins currently available with sufficient length, breadth of sampling, and level of sequence conservation to test ancient evolutionary relationships (8). These proteins are all ∼400 amino acids long, with ≥65 to 70% identity among all taxa, and are therefore expected to contribute similarly in combined analyses.

All taxa with sequence data for a minimum of three of these proteins were included, plus several key taxa with sequence data from only one or two proteins (9). Major groups represented by only a single taxon or set of closely related taxa were excluded, except for the glaucophyte and rhodophyte algae because of current interest in these taxa (10). Including a substantial amount of missing data entries limited the choice of phylogenetic method used (11) to unweighted parsimony (12), with results confirmed by nucleotide-level maximum likelihood (13). The accuracy of amino acid parsimony with these data is shown by previous analyses excluding taxa with missing entries, where neighbor-joining distance and protein maximum likelihood found equivalent results to parsimony (8).

Phylogenetic analysis of the concatenated, deduced amino acid sequences of four protein-encoding genes produces a highly resolved phylogenetic tree including 14 higher order eukaryote taxa (Figs. 1 and 2A). Forty-nine of a total of 58 nodes receive bootstrap percentage (BP) support of 75% or greater at the amino acid level [∼95% probable accuracy (14)], slightly lower in some cases at the nucleotide level (Fig. 1). Of the 11 higher order taxa with multiple representatives, all are reconstructed as monophyletic; all but one are supported by >89% bootstrap at the amino acid level (aaBP, Fig. 1) and all but two by >85% bootstrap at the nucleotide level (ntBP, Fig. 1). Only the ciliates receive a relatively low aaBP of 70% (58% ntBP, Fig. 1), likely due to their fast evolutionary rates for both EF-1α (15) and actin (16). Of these 11 major taxa, only the Fungi + Microsporidia and the Mycetozoa are controversial.

A kingdom-level phylogeny of eukaryotes, based on combined protein sequences. The tree shown is one of two shortest trees found by parsimony analysis of concatenated EF-1α, actin, α-tubulin, and β-tubulin amino acid sequences (44). The tree is 5056 steps long with branches drawn to scale as indicated (43, 45). Bootstrap values >50% are shown above and below the lines, respectively, for amino acid parsimony (aaBP) and maximum likelihood analyses of second codon–position nucleotides (ntBP). Parenthesis indicate the aaBP for the grouping of animals + fungi plus lobosa + mycetozoa in analyses omitting Bangiophyceae and Cyanophora (see text). Dashes (–) below lines indicate nodes not tested in the ntBP analyses shown [Bangio- phyceae, Cyanophora, andAcanthamoeba omitted; see text (29)]. For taxa with missing data, the sequences used are indicated in brackets to the right of taxon names in uppercase and lowercase letters for complete and partial sequences, respectively (E = EF-1α, C = actin, A = α-tubulin, B = β-tubulin). The lowest common taxonomic designation is given for sequences combined from different taxa. The shortest trees differ only in their placement ofPneumocystis, as shown by the thin dashed line; all other slanting dashed lines indicate alternative groupings found with ntBP >50%. The horizontal dashed line (left center) indicates tentative placement of the Diplomonadida and Parabasalia (46).

The Microsporidia were long classified as early-branching eukaryotes on the basis of SSU rRNA trees [e.g., (2–4)], which is further supported by early analyses of EF-1α, elongation factor 2 (EF-2), and large subunit (LSU) rRNA [reviewed in (17)]. However, the strong placement of these taxa with fungi in our tree (95% aaBP, 85% ntBP, Fig. 1; Fig. 2A, node 2) is also found in trees of α-tubulin, β-tubulin, RNA polymerase II largest subunit (RPB1), valyl-tRNA synthetase, and the TATA box binding protein (Fig. 2B) (17). Microsporidial EF-1α's also encode an insertion diagnostic of animals and fungi (8, 17, 18). Reanalyses of EF-1α, EF-2, and SSU and LSU rRNA suggest that the early branching of Microsporidia in these trees is an artifact of their accelerated evolutionary rates for these genes [reviewed in (17)].

The long-standing controversy over the monophyly of the Mycetozoa (Fig. 2A, node 5) has been fueled by the failure of these taxa to branch together in most SSU [(3, 4), but see (2)] and LSU (19) rRNA trees. However, the strong placement of these taxa together in our tree (90% aaBP, 100% ntBP, Fig. 1) is also seen with α-tubulin, β-tubulin, actin, and EF-1α [with the latter case also including protostelid slime molds (Fig. 2B) (20)]. All Mycetozoa also produce morphologically similar, quasi-multicellular fruiting bodies (21). Therefore, the apparent unrelatedness of myxogastrid (plasmodial, e.g., Physarum) and cellular (e.g.,Dictyostelium) slime mold rRNA sequences (Fig. 2B) may be an artifact of their fast evolutionary rates and oppositely skewed G+C nucleotide contents (3).

It is at the “super-taxon” level especially that the combined data show markedly greater resolution than any single-gene phylogeny (Figs. 1 and 2B). This includes a new protistan supertaxon of Euglenozoa + Heterolobosea (the Discicristata). Four other higher level associations receive >86% aaBP support (>85% ntBP, Fig. 1): Animalia + Fungi (opisthokonts), Ciliophora + Apicomplexa (Alveolata), Mycetozoa + Lobosa (Amoebozoa), and an even higher level clustering of Amoebozoa + opisthokonts. Two further supertaxa are suggested with moderate or weak support, respectively: Alveolata + Heterokonta (chromalveolates) and a holophyletic Plantae (Fig. 1). In contrast, most individual molecules, including SSU rRNA, reconstruct less than half of these groups (Fig. 2B).

Combined protein data strongly support a cluster of Euglenozoa + Heterolobosea (81% aaBP, 89% ntBP, Fig. 1; Fig. 2A, node 18). The only previous molecular phylogenetic support for this group comes from analyses of β-tubulin, particularly with limited taxa and semi-constrained branches (22). However, other data are suggestive; these taxa tend to branch near each other in trees of α-tubulin and SSU rRNA (Fig. 2B) and, at least in the case ofAcrasis, EF-1α (18). A possible close evolutionary relationship between euglenozoans and heteroloboseans was proposed by Patterson and others on the basis of their shared possession of discoidally shaped mitochondrial cristae (23), hence the term “discicristates,” later formalized to Discicristata (24). However, this mitochondrial morphology is reported in several enigmatic protists, i.e., Malawimonas,Nuclearia, Stephanopogon, and possiblyMinisteria (23–25), for which there are no published molecular data.

A grouping of Mycetozoa + the lobose amoebaAcanthamoeba (Amoebozoa, Fig. 2A, node 6) is strongly supported by these data (86% aaBP, Fig. 1), by trees of actin (Fig. 2B) and the actin-related proteins ARP2 and ARP3 (27), and by mitochondrial genome similarities inDictyostelium and Acanthamoeba (28). Morphologically, these taxa share amoeboid stages with lobose pseudopodia moving in a smooth, noneruptive manner (21,23). Patterson denotes these taxa the “ramicristates,” on the basis of shared mitochondrial morphology (23).

Combined protein data further place the amoebozoans as the sister group to the opisthokonts (Fig. 2A, node 7). With deletion of the nearby branches of Bangiophyceae and Cyanophora, this cluster receives 99% aaBP (85% ntBP, Fig. 1) (29). Thus, these data strongly place the amoebozoans as a closer sister group to the opisthokonts than all other eukaryotes examined here, with the possible exception of glaucophytes and/or rhodophytes. A close relationship between amoebozoans and opisthokonts is also seen with actin and summed maximum likelihood scores (Dictyostelium + opisthokonts), and is suggested by EF-1α and possibly tubulins [reviewed in (8)]. SSU rRNA data also consistently place the lobosans, at least, very close to opisthokonts (1–4).

One problematic taxon here is the tentatively supported Plantae (Fig. 2A, node 10, and Fig. 1). This group includes the three lines of primary photosynthetic eukaryotes (Rhodophyta, Glaucophyta, and Viridiplantae), all other algae having acquired their plastids second-hand from these [reviewed in (30)]. A holophyletic Plantae, allowing for a single origin of eukaryotic photosynthesis, is well supported by a large body of organelle data [reviewed in (30)]. While confirmation from nuclear gene data is still lacking, especially for glaucophytes, at least two nuclear genes support a red-green plant clade [vacuolar adenosine triphosphatase (V/A-ATPase) and EF-2 (Fig. 2B)], and previous rejection of this group by RPB1 is now questioned [reviewed in (10)]. Because rhodophytes and glaucophytes are represented here by only a single taxon each (Fig. 1), increased taxonomic sampling may improve their resolution with these data.

The ciliates + apicomplexans (alveolates, Fig. 2A, node 14) are a widely accepted taxon and are well supported here (86% aaBP, 90% ntBP, Fig. 1), as well as in trees of HSP70, α-tubulin, β-tubulin, SSU rRNA, and LSU rRNA (Fig. 2B); the latter two genes also clearly place the dinoflagellates in this group (1–4). Alveolates possess cortical alveoli or related structures, systems of membrane-bound sacs lying beneath the plasma membrane and performing structural roles or giving rise to external coverings such as pellicles [ciliates (21)] or thecal plates [dinoflagellates (21)]. The strength of this group in our tree (Fig. 1) is particularly striking, because individual trees of actin and EF-1α give notoriously poor resolution of these taxa (see above) (15,16).

The clustering of alveolates + heterokonts (chromalveolates,Fig. 2A, node 15) suggested by these data (61% aaBP, 53% ntBP, Fig. 1; see below) is well supported by trees of HSP70 and SSU rRNA “crown” taxa, and is weakly supported by β-tubulin (Fig. 2B). Although this grouping appears relatively weak in our tree (Fig. 1), the heterokonts are also highly incomplete for these data (Fig. 1), and partitioned analyses suggest that this grouping here may be robust (see below). Cavalier-Smith designates this group the “chromalveolates,” also including in it the haptophyte and cryptophyte algae (31). This grouping would allow for a single gain by secondary (eukaryote-to-eukaryote) endosymbiosis, of all chlorophyll c–containing plastids plus the plastid-like organelle of apicomplexans (31). However, there are many fundamental differences between these plastids (30, 31) and, currently, little molecular data on haptophytes and/or cryptophytes to test this question.

Combining data should increase phylogenetic accuracy both by increasing signal and dispersing noise (32) and should uncover the common underlying signal of the data partitions rather than test the relative strengths of conflicts among them (32). Although it is difficult to distinguish true phylogenetic conflict (due to gene paralogy or lateral transfer) from tree reconstruction artifact, the latter is suggested by lack of statistical support for conflicting topologies (33). Furthermore, conflicts that are only weakly supported by a single partition should not lead to strongly supported conclusions (by themselves) in a four-partition analysis. To test for the possible presence and strength of phylogenetic conflict among these data, each partition (protein) was analyzed separately, as well as in all possible pairwise and three-way combinations (Fig. 3) (34).

Interactions among data partitions. Phylogenetic support of individual proteins and all their possible combinations (pairwise, three-way, and all four combined) are shown for the taxonomic groupings listed at the left (group numbering as in Fig. 2) (34, 55). Protein combinations analyzed are indicated above by single letters: E = EF-1α, C = actin, A = α-tubulin, and B = β-tubulin. Accepted groups correspond to nodes found in ≥50% of all shortest trees, with the level of bootstrap support indicated by shaded (<50%), striped (50 to 75%), and solid (75 to 100%) green circles (34). Rejected groups are indicated by striped (<50%) and solid (50 to 65%) pink circles, corresponding to the percent aaBP for the most strongly supported alternative grouping including any member(s) of the group in question (34). All taxa missing data for any protein were deleted from relevant analyses; all analyses excluded Diplomonadida and Parabasalia (9).

Analyses of individual proteins show that no conflicting groupings are supported by >65% aaBP and most by much less (Fig. 3, columns 1 through 4). Therefore, no strongly supported conflicts exist among these data. Pairwise analyses also show considerable evidence of phylogenetic cooperation among all partitions for most questions (Fig. 3, columns 5 through 10); i.e., most pairwise combinations tend to support the same conclusions. This is further evidence of common underlying histories among all partitions.

Cooperation among partitions is especially clear at the super-taxon level (taxa 4, 7, 14, 15, and 18); most individual proteins reject these groups, but most pairwise combinations support them (Fig. 3). For example, the chromalveolate grouping (node 15) is rejected by all three individual proteins testing it, but all pairwise combinations accept the group at least weakly, and the three proteins combined support it strongly (79 to 80% aaBP) (34) (see note added in proof).

Even paired proteins can give markedly greater resolution. For example, actin and β-tubulin individually support 5 of 17 and 7 of 18, respectively, of all nodes tested, but together find moderate to strong support for 15 of 17 nodes (Fig. 3, columns C, B, and CB). EF-1α and actin, both notorious for their inability to unify ciliates [see above and (Fig. 2B) (15, 16)], together not only reconstruct this group but give it a moderately strong 71 to 72% aaBP (Fig. 3, line 12, column EC) (34).

Our data (Fig. 1) suggest that the deep-level phylogeny of eukaryotes may be solvable, despite strong predictions to the contrary (6). Furthermore, the resolving power of these data should continue to improve as more of the constituent sequences are completed (35) and taxon sampling is broadened (36). However, all presently available single-gene phylogenies support only a subset of the major taxa found in the combined data tree, and none supports the same subset (Fig. 2B) (37). This suggests that each gene has its own unique set of strengths and weaknesses as a phylogenetic marker, and it is unlikely that any alone will ever be able to strongly, or perhaps even accurately, resolve all deep branches of a universal tree. Nonetheless, our data suggest that many of the apparent conflicts between individual gene trees (Fig. 2B) are relatively superficial (Fig. 3), and we see little evidence here of fundamentally different phylogenetic histories [as expected if lateral transfer affected these genes (38)].

The single most critical question unanswered by these data is the position of the root of the tree (Fig. 1). Data sets with close archaeal or bacterial homologs are needed to address this most fundamental question.

Note added in proof: Recent evidence suggests that thePorphyra (Bangiophyta) btub-2 used here is from an oomycete contaminant, the highly divergent btub-1 being the true red algal gene (56). Consistent with this, substituting btub-1 for btub-2 increases chromalveolate support to 72% aaBP (Fig. 1). The only other change seen is a weak attraction of the now very long Bangiophyta branch to the nearby Amoebozoa, further destabilizing the Plantae clade (57).

↵* To whom correspondence should be addressed. E-mail: slb14{at}york.ac.uk

Diplomonads, parabasalids, and entamoebids, the only strong phylogenetic conflicts among these data, were excluded from most analyses (see below). These taxa branch with animals + fungi with tubulins; among the deepest protists with EF-1α and actin (8).

Sequences were aligned by eye, because length variations are small and rare. Gaps and any adjacent ambiguous alignment were deleted, as were 5′ and 3′ termini (missing in PCR-generated sequences). Monophyly of peripheral clusters was tested with all available sequences by 100 bootstrap replicates of PAM 250–corrected neighbor-joining distances (41). These clusters were then trimmed to favor short terminal branches and taxonomic breadth. The final data set includes 834 parsimony-informative sites and is available from SLB on request. Sequencing of Acrasis EF-1α and Nosema β-tubulin (18) and of Polysphondyllium and Acytostelium α- and β-tubulin and Acrasisβ-tubulin (42) have been described (GenBank accession numbers AF190771-2 and AF276942-7). All final analyses were derived from PAUP* 4.0b2 with default settings (12) unless otherwise noted. Amino acid parsimony analyses utilized optimal tree searches of 10,000 rounds of random sequence addition and 10,000 bootstrap replicates. Branch lengths were averaged over three optimization strategies [acctran, deltran, and minF (43)]. Maximum likelihood analyses of second codon-position nucleotides utilized the F84 (13) and GTR+I+G (34) substitution models and 100 (Fig. 1) or 500 (34) bootstrap replicates. Although the GTR+I+G model gave the closest fit to these data, both models gave very similar results (34).

The tree has a retention index of 0.6255; rescaled consistency index of 0.3380. One added tree island was found at 5060 steps placing Cyanophora as outgroup to animals + fungi and Bangiophyceae as outgroup to Amoebozoa. Phylogeny of animals is not reliably resolved here due to their complex multigene families for all proteins except EF-1α (8).

Considerable data place diplomonads and parabasalids among the earliest diverging eukaryotes (37). However, their placement in this tree is highly tentative due to strong conflict among these proteins on this issue (8, 9) and the lack of actin data. Therefore, these taxa are appended to the final tree on the basis of separate analyses including all taxa, which place them, with ∼50% aaBP, as shown or as the outgroup to opisthokonts.

Terminal branch lengths were increased proportionally for taxa missing >25% of the data (e.g., the branch length of a taxon missing one complete gene would be multiplied by 4/3). Internal branches, where all subsequently arising taxa have the same percentage of missing data, were likewise increased.

Amino acid sequences were analyzed as described (44) except that optimal tree searches used 100 rounds of random sequence addition and bootstrap analyses used 1000 replicates of one round of random addition with a maximum tree-ceiling of 500. Taxa were excluded from analyses of partitions for which they had missing data.

We thank M. E. Holder of the University of Houston High Performance Computing Center for help with likelihood analyses, Y. van de Peer for useful discussions, and especially J. Felsenstein for the original suggestion. Supported in part by a National Sciences and Engineering Council of Canada (NSERC) grant 227085 to A.J.R. and by a Medical Research Council of Canada (MRC) grant MT4467 to W.F.D.