Abstract

The first part of the paper briefly overviews the problem of gene and species trees reconciliation with the focus on defining and algorithmic construction of the evolutionary scenario. Basic ideas are discussed for the aspects of mapping definitions, costs of the mapping and evolutionary scenario, imposing time scales on a scenario, incorporating horizontal gene transfers, binarization and reconciliation of polytomous trees, and construction of species trees and scenarios. The review does not intend to cover the vast diversity of literature published on these subjects. Instead, the authors strived to overview the problem of the evolutionary scenario as a central concept in many areas of evolutionary research. The second part provides detailed mathematical proofs for the solutions of two problems: (i) inferring a gene evolution along a species tree accounting for various types of evolutionary events and (ii) trees reconciliation into a single species tree when only gene duplications and losses are allowed. All proposed algorithms have a cubic time complexity and are mathematically proved to find exact solutions. Solving algorithms for problem (ii) can be naturally extended to incorporate horizontal transfers, other evolutionary events, and time scales on the species tree.

1. Reconciliation of Gene and Species Trees: A Brief Overview

This section of the paper does not intend to cover the vast diversity of published literature on the problem of trees reconciliation. Instead, the authors strived to overview the problem of defining algorithmic construction of the evolutionary scenario as a central concept in many areas of evolutionary research. Important definitions are discussed, and essential problems are highlighted. We believe that, despite many approaches to defining the scenario known today, its solid theoretical framework is still to be developed.

1.1. Evolutionary Scenarios and Fields of Their Application

The evolution of the genome, apart from the mutation process, is an entangled complex of individual and concerted evolutions of genes, their regulations, gene content and arrangement on chromosomes, genetic flows between the genome and intracellular organelles, and so forth. Their evolutionary histories often do not coincide with each other and with patterns of speciation giving the rise to a variety of evolutionary events, such as gene duplications, losses, gains, horizontal transfers, chromosome rearrangements, and others. These phenomena play a pivotal role in evolutionary plasticity of the genome, the emergence of genes and gene families with novel functions, maintenance of the molecular machinery of the cell, evolutionary adaptation of the organism, and so forth. As known today, various types of horizontal transfers were the key force to drive the evolution of prokaryotes [1–3], while duplications of genes, partial or entire genomes, and mass gene loss events formed the genotypes of many higher eukaryotes, including higher plants [4–6] and vertebrates [7–10]. The genomic change fixed in generations over time ultimately shapes the biological diversity.

Important information contained in the discrepancies between these evolutions can be extracted and studied with the methods of trees reconciliation. Knowledge of ancestral genomic events provides efficient instruments in a range of fields, like establishing orthology/paralogy relationships between gene families [11–14], functional gene annotations [15–18], reconstruction of ancestral genes and genomes and their dating [19, 20], accurate reconstruction of gene and species trees [18, 21–27], construction of phylogenies based on whole genome data [22, 23], event-based reconstruction of coevolution [28] and its applications in ecology and biogeography [29–31], phylogenetic approaches to predict protein interactions [32], and so forth. A particularly intriguing problem is the coevolution of species, genes, and their regulatory systems, including binding sites, protein and RNA factors, DNA and RNA secondary structures, and RNA triplexes, which is poorly understood even in its statement. Further research in this area will shed more light on understanding the principles of concerted evolution at various levels.

In complex studies of coevolution it is vital to develop reconciliation approaches that account for as many various evolution events as possible. Not only inferring the events per se but also their mutual arrangement in time is important. Such an arrangement is called the evolutionary scenario. An overview of approaches to define and construct the scenario with trees reconciliation is the scope of this section.

In earlier works, scenarios accounted for gene duplications and losses only [33–36], some later—for only transfers and losses [37–41]. Such incomplete scenarios are useful in certain cases, for example, in studies of the Metazoa where transfers are very scarce, with low-copied or functionally nonredundant gene families or under low rates of duplications and losses [42, 43]. Timing the species tree and, particularly, imposing time scales (slices) are used in recent models to incorporate horizontal transfers [44–46]. The problem of defining and constructing the evolutionary scenario in its broad sense is actively studied, although its full definition is by far not yet obtained. Section 2 of this paper contains some original results obtained on these problems.

Approaches to substitute the species or even gene trees with a forest or net (graph or hypergraph) and to identify their areas that cannot be described by a tree are important but remain poorly studied [47].

The accuracy of reconciliation methods depends on the quality of initial phylogenetic data, usually gene trees, and multiple alignments in selected cases. The traditional steps of building gene trees (constructing multiple alignments, the choice and configuration of inference methods, robustness verification, etc.) can be nontrivial, especially for the automated generation of phylogenies on genomic scales. These methods are ever developing and are not discussed here. Some approaches are proposed or overviewed, for example, in [48–52], with extensive further referencing provided therein.

To mention is the group of methods that does not rely on gene trees to construct the scenario. Instead, genetic data in extant species of the given species tree is used to reconstruct the same type of data at internal nodes. In [53, 54] the authors addressed the problem of constructing parsimonious scenarios for individual sets of orthologous genes on a fixed species tree. Duplication events are not considered, and a horizontal transfer is not scored separately from an ab novo gene gain.

An extensive corpus of studies is devoted to the reconstruction of ancestral molecular characters and properties, rather than inferring discrete evolutionary events on the species tree. Such can be ancestral sequences, their lengths, primary and secondary regulatory structures, the tree areas with potential genetic transfers, and so forth [55–59]. Deterministic and probabilistic models (in particular, the Gibbs field approach) to reconstruct ancestral sequences and secondary structures are discussed in [60–62] presents a dedicated web service. These works remain out of the scope, as do studies of the mutation process and various reconciliation applications. The reader is referred to the original cited works for further details.

1.2. Reconciling Gene and Species Trees: The Classic “Embedding ” as the Basis of Other Mappings and Mapping Costs

Earlier scenarios accounted only for duplications and losses. Such is the classic definition of mapping , usually referred to as the “embedding .” In [33, 63] it maps vertices of a gene tree into vertices of a species tree. Namely, each vertex of a gene tree is assigned a vertex of the species tree that corresponds to the last common ancestor of the species containing the leaves descendants of . Mapping explicitly infers duplications and implicitly losses.

Define edges of tree as tubes to distinguish between edges of and . Each root is supplied with an additional root edge (or root tube), which ends in a superroot; that is, the superroot is the only vertex with the single child. Henceforth, all trees are described as directed downwards from the root.

Consider another definition of the “embedding .” Define mapping as a mapping of vertices in the gene tree into vertices or tubes (often both) in species tree that satisfies the conditions: the leaves in map into leaves in having the same species notations; the superroot of maps into the root tube in ; mapping preserves the natural order relation on and ; which is defined on any tree by the branching order downwards from the root (i.e., this relation keeps the succession of lineages). Additional less determinative conditions are formulated in Sections 2.3 and 2.7 ([45, 64]). In this paper, most definitions are provided in Section 2 and the reader is expected to be acquainted with general terminology used throughout the text.

Definition of continued. The total sum of duplications and losses (the “embedding cost”) has the minimal value on among all costs of possible mappings of gene tree into species tree . The embedding cost of mapping is denoted ; the analogous cost of is denoted ; that is, (where is a variable). In other words, and are sums of the amounts of gluings and gaps in mappings and , respectively; these numbers can be weighted according to the costs of corresponding event types (in this case, duplications and losses). Thus, mapping can be defined as a global minimum of the embedding cost functional , where variable runs over all mappings of into . Note that the list of event types and the localization of evolutionary events are defined on the species tree individually for each mapping (refer to definitions in Section 2.4).

Algorithmically, mapping is built by induction from leaves toward the root in linear computing time, and its cost is computed simultaneously [65, 66].

Study [45] describes a similar definition of mapping , and a different construction algorithm is applied from the root toward the leaves. It is a useful definition in terms of its extensibility to scenarios with gene horizontal transfers and gains. The presented algorithm simultaneously computes the mapping and its cost.

In [36] all possible reconciliations of gene tree and species tree are considered, that is, all possible mappings of into . This approach is further developed in [43], where maps each vertex in into a vertex or tube in , thus inferring the speciation (if is a vertex) and duplication (if is a tube) events, respectively. An algorithm described in [43] generates a random reconciliation of and , enumerates all such possible reconciliations, and calculates exactly the minimal number of fixed operations needed to rearrange one reconciliation into the other.

Let only duplications and losses be considered, and let be a binary gene tree with a predefined set of “reliable” edges. To find is a tree with the same set of leaves and containing all clades induced by reliable edges such that minimizes the embedding cost of its mapping into a given binary species tree . Algorithms to solve this problem are described in [35, 67]; in [35] the algorithm is proved to find exactly the optimal gene tree in cubic time, while [67] offers a heuristic solving algorithm. Similarly, in [68] duplications, losses and transfers are accounted for to find a gene tree such that it contains a predefined set of reliable edges (i.e., the induced clades) from and minimizes the embedding cost of any mapping of into a given binary species tree . A heuristic solving algorithm is proposed.

An approach to reconcile gene and species trees based on information about synteny of corresponding genes in the genome is proposed in [69]. An algorithm is described to build a forest of trees that reflect the evolution of pairs of neighboring genes by minimizing the embedding cost of gains and losses of the gene pairs. Computing time of this algorithm has the order , where is the number of gene trees and is their maximal size.

1.3. The Binarization Problem for Fixed Gene and Species Trees

The algorithm described in [70] has a linear time complexity, and, given a polytomous gene tree , binary species tree , and their mapping , searches for a binarization of by first minimizing the total sum of duplications and then the total sum of losses in the obtained set of binarizations.

Study [71] describes a linear time algorithm to binarize the tree against the tree using mapping , provided that only duplications and losses are allowed. A binary resolution of a polytomous is constructed such that the resulting binarized gene tree optimally reconciles with the species tree ; that is, it has the minimal embedding cost compared to other binarizations. Importantly, the algorithm is mathematically proved to find the global minimum of the embedding cost functional ( is a variable). The authors of [71] reference the history of the binarization problem for the case of duplications and losses under fixed and .

Study [25] uses a similar minimization criterion for to binarize many polytomous gene trees against a binary species tree when horizontal gene transfers are allowed, and the variable is an arbitrary mapping (refer to Sections 2.7 and 2.12). In [25] the algorithm is proved to find the globally minimal binarization and possess the complexity determined as follows: if is a maximal degree of polytomy among all vertices in , then the computing time has the order of the product of the total number of vertices in initial trees and and coefficient .

In [70] it is proved that the optimal binarization problem is NP-complete for the case of a polytomous species tree even if a gene tree is binary. However, heuristics is proposed to handle even nonbinary gene trees. In [72] another heuristic algorithm is proposed to solve the same problem, nevertheless requiring a binary gene tree.

The algorithm described in [73] computes all possible binarizations of a polytomous species tree in order to find such that minimizes the embedding cost for an input fixed binary gene tree against the variable . In this search all event types are considered, including transfers; and the variable is an arbitrary mapping. A new condition is imposed: let a vertex in a gene tree be mapped into a vertex in the species tree; then both child clades of contain at least one species from the clade of . The computing time of the algorithm is the product of a polynomial of degree 4 (a function of the number of leaves in the input data) and an exponential functional that depends on the maximal degree of polytomy in the species tree.

Sections 2.12 and 2.13 present an essentially different statement of the binarization problem (refer also to [25]).

1.4. Evolutionary Scenarios with Horizontal Transfers: Coevolution of Genes and Their Regulation Systems on a Species Tree

Accounting for gene horizontal transfers in evolutionary models is vital for understanding the evolution of many life forms, especially prokaryotes [1–3]. It also provides efficient tools to study the evolution of molecular systems, establishing orthology/paralogy relationship between gene families [11–14], and so forth. In [74] the authors give a broad view of the perspectives to reconstruct the Tree of Life within the general framework of genome evolution, the role of gene horizontal transfers, duplications and losses in the emergence of new molecular functions, and evolutionary adaptation.

With only duplication events allowed, for a given set of binary gene trees and a binary species tree , consider any mapping of into . In the approach in [75], for each a duplication event is attempted closer to the root of a species tree but below , where is the parent of (if is the root, is attempted closer to the root). A functional is proposed that depends on and equals the sum (over all vertices in ) of maximal heights of subtrees in all (not only those that reach the leaves) mapped by into a vertex . The desired are mappings that minimize this functional. A linear complexity proved algorithm to find this global minimum is proposed. Historical references to this approach are provided in [75] and in review [76].

Event-based approaches to study coevolution of various elements are discussed in [28], and their applications in ecology and biogeography are discussed in [29–31]. For example, in [77, and unpublished materials] the authors present a model and an effective algorithm to reconstruct coevolution of genes and their regulatory systems (binding sites, protein and RNA factors, DNA and RNA secondary structures, RNA triplexes, etc.) under horizontal transfers and other events allowed on a species tree. A general coevolutionary scenario was constructed based on a universal functional that combines requirements specific for individual scenarios of the co-evolving elements. Evolutionary events inferred in individual scenarios within the general coevolutionary scenario appear to be biologically consistent (coordinated with each other). Inferring coevolutions is an important and complex problem, which we do not almost discuss in this paper.

1.5. Time Slices on the Species Tree as an Approach to Accounting for Horizontal Transfers

When horizontal transfers are included in the model, a gene cannot transfer between two tubes located anywhere on the species tree , a transfer is possible only between the “contemporaries.” To correctly describe transfers, the tree must be partitioned in time slices, for example, by dating its tubes or vertices. An approach to do so is presented in [44], where each tube is associated with a time interval, and a transfer between tubes is allowed if their intervals have a non-empty intersection. A corresponding construction algorithm is described in [44], without the complexity assessment. A very complicated original description of the algorithm does not allow us to provide detailed comments.

Assume that to correctly define a transfer in time is to allow it to occur exactly within one time slice (a set of predefined time slices on the species tree is to be fixed). If the correctness condition is not imposed but transfers are allowed, the fastest algorithm constructs a scenario in time of the order , where and are numbers of leaves in the input gene and species trees [78]. Finding a scenario defined correctly in time is an appealing challenge and requires an intricate imposition of time slices on the tree. Constructing the slices is a difficult problem of its own already at the level of definition. An algorithm with complexity that solves it is proposed in [45], albeit without a proper biological justification.

An approach to construct a time-correct scenario is finely elaborated in [45] and, independently, in [46]. The constructing algorithm in [46] uses a prefixed set of time slices and does not consider (similarly to [44]) the common case of a gene transfer with loss of the donor copy. The authors prove the polynomial time complexity of their algorithm, however not providing an exact assessment of the polynomial degree. In [45] the algorithm accounts for all types of transfers and differs conceptually from those proposed in [44, 46]; it is proved to have the complexity of , where is the number of vertices in a species tree with preimposed time slices and is proved to find the exact global minimum (under certain conditions). The proof is given in [79].

In [80] the following condition (below referred to as the “tofig-condition”) on mapping is formulated. Assume there exists a linear order at vertices of the species tree , for which: for any tube in the inequality is valid, and if for two edges, and , in a gene tree one precedes the other in terms of the natural order on edges, then the upper terminus of the tube that “contains” (i.e., is this same tube or its lower terminus) must “precede” (in the sense of ) the lower terminus of the tube that contains . Under this condition the problem of finding a globally minimal scenario is NP-complete [81, 82]. Strengthening this condition may simplify the situation. For example, let each time slice on consist of tubes equidistant from the root, and let, as mentioned above, horizontal transfers be permitted only within the common slice. Such the condition on implies the tofig-condition if is a width-first linear order. The problem of finding a globally minimal scenario under the above-mentioned strong condition becomes polynomial in time [45].

The notion of the evolutionary scenario, specifically for a pair , is very important in mathematic aspects of the theory of evolution. A realistic scenario is such that accounts for as many different types of gene evolutionary events as possible, including various types of horizontal transfers.

Analogously to mapping , a candidate mapping (scenario) is defined at vertices of the gene tree , with its values being the vertices or tubes (often both) of the species tree , such that keeps the natural orders (the successions of lineages on the trees). Each mapping defines its own set of evolutionary events (exact definitions are provided in Sections 2.3 and 2.7). As in mapping , each event type is assigned a cost. Analogously to the embedding cost of mapping , the cost of a candidate mapping is the sum of event costs defined by , which may be weighted according to the reliability of corresponding vertices and the type of event. The problem is to find the mapping (scenario ) that globally minimizes the total cost under certain constrains, which almost always need to be imposed on its design.

The cost of the pair is the cost of its minimal scenario and is denoted . Therefore, one needs to minimize the functional over all mappings of into to obtain the desired scenario and the value .

An alternative approach is to describe the scenario in terms of a stochastic process on a species tree. Selected relevant approaches are proposed in [25].

An algorithm to construct a scenario for a gene tree and a species tree derived by partitioning in time slices, is described in [25, 45, 79]. The running time of the algorithm has the order of the product of and the number of leaves in . The algorithm, its proof, and all definitions from [45, 79] are reproduced in [83], where the algorithm was extensively tested on novel data. In [84] the same (as the authors perceive) algorithm is applied to different biological data.

The importance of taking into account suboptimal scenarios that can become optimal under slight variations of the costs of event types is demonstrated in [80]. An approach to deal with suboptimal scenarios is proposed in [25], where the authors also examine the case of gene gain using the outgroup approach (refer to the extended event list in Table 1 in [25]).

1.6. Constructing the Supertree

The definitions and algorithms of trees reconciliation and construction of the scenario (mapping) stated above can be applied to another long studied problem: given a set of gene trees, find the tree , for which the total cost of all events for pairs reaches the global minimum ( and are variables; usually , which gives the cost ). The tree is called a supertree. In this statement, the supertree may be imposed certain constraints depending on the initial gene tree data that need to be taken into account when optimizing the total cost functional.

In the classic sense, the supertree is constructed with no constraints by merging input trees using a variety of heuristic methods based on various tree compatibility criteria. In distance approaches, the supertree is found by minimizing the average distance between it and all input trees. Defining a proper distance is therefore of importance. In the framework of trees reconciliation, this problem reduces to the minimization of the functional defined via the total cost of evolutionary events for trees over all . The classic and still commonly used distance was introduced in [33] as the total cost of duplications and losses (and transfers, if allowed).

In some approaches, the supertree construction step is preceded by filtering out leaves, subtrees, or entire gene trees that do not satisfy certain reliability conditions [85]. The discarded elements can later be used to detect areas of “active” evolution on the supertree. We do not discuss such kind of approaches here.

The problem of building the supertree is NP-complete if no constraints are imposed on the desired tree , even when only duplications and losses are allowed [86]. This stimulated the development of heuristic methods and attempts to reformulate the problem itself.

Heuristic Approaches. Among such is the quartet method that consists of two phases. At the first phase, trees are built for all quartets of species; here the choice of the reliability function to assess quartet topologies plays the important role; refer, for example, to [87]. At the second phase, the supertree is built by optimally reconciling the quartet species trees using a heuristics. In different implementations of the second phase, the supertree is constructed either “from root to leaves” [88] or “from leaves to root” [89]. The method produces an unrooted tree.

Rooted supertrees are produced by the triplet method, an analog of the quartet method, where the final tree is obtained by assembling triplet trees also using heuristics, for example, as described in Phase 2 of the supertree building algorithm from [25]; refer to Section 2.10 below.

Other methods use heuristics to maximize the functional of clades matching among two trees (rooted supertrees are produced) [90] or use a matrix representation of multiple trees [91]. A simple method to root species trees is proposed in [25, Suppl. 1].

Out of the scope of this paper remain other approaches to infer a species tree, such as the supermatrix strategies, which are popularly used in many phylogenetic studies of particular groups as well as larger taxa. In the supermatrix design, sets of orthologous genes sampled across the compared species are aligned, concatenated into a “superalignment” (supermatrix) and processed for computing one tree. In so doing, this method combines partially overlapping species samplings in the input orthologous sets to accommodate all species in one tree. Although the supermatrix approach relies on the well-established methodology of inferring gene trees, there exist many pitfalls that limit its application to larger analyses on a genomic scale. Among them are the strict requirement on orthology, missing data in sparse supermatrices, and different modes of evolution exhibited by different supermatrix partitions (often exacerbated by disparities in their size) and even by individual positions in the alignments, which requires the usage of sophisticated evolutionary models and causes inevitable computational burden that may become intractable with larger datasets [92–95].

In this context, fine selection of orthologs has received much attention as a problem of high relevance and arduous both ideologically and computationally. Approaches to this problem diverge into reconciliation-based (e.g., [11–14]) and graph clustering methods (e.g., [96–100]). The authors in [100] proposed a quadratic in time complexity clustering algorithm to construct orthologous protein families based on sequence similarity (and local synteny in certain cases). It was applied to mitochondrial, plastid, and some other (unpublished) genomic data. The obtained clusters well conform with known protein functional annotations, independently constructed orthologous groups, and other protein characteristics. The clustering revealed some lineage-specific proteins. Thus, mitochondria of the vine Vitis vinifera were found to encode proteins also typical for plastids, which implies that a horizontal genetic flow between these organelles had happened in the past [100].

Reformulation of the Problem. The development of novel reconciliation approaches and their effective solving algorithms with low (polynomial) complexity that are mathematically proved to find the global minimum (presumably the correct supertree) holds a good perspective. The algorithm originally developed by the authors [23, 64] introduces a condition that allows to effectively find the global minimum of the total cost functional. The condition constrains the desired supertree to contain only clades from the input gene trees and certain combinations of them. Under this condition and if only duplications and losses are allowed, the algorithm is mathematically proved to find the global minimum of the cost functional in time cubic of the input data size [64]. Solving the same problem for the case of transfers is an important perspective. This approach is based on a different principle compared to other known methods.

1.7. Probabilistic Definitions of the Evolutionary Scenario: Evolution as a Stochastic Process and Coalescent Approaches

The definition of the clade probability as a fraction of trees containing a given clade was introduced in [101]. The authors argue that the correct supertree commonly contains all clades from the initial tree set with the probability >1/3.

The species tree reconstruction under the assumption of numerous transfers is discussed in [102]. Using a probabilistic approach, it is shown that the species phylogeny is tree-like even with a high transfers content, that is, when their number linearly depends on the average number of leaves per tree. Conversely, in [24] it is mathematically proved that the triplet method recovers the correct supertree with high probability only if transfers are not many. Studies [24, 102] well reference this approach.

A stochastic procedure to construct a scenario with all types of events, including transfers, is proposed in [25]. The authors describe an algorithm to compute expectation values of the event numbers in each tube and over all tubes of the species tree. The proposed approach can also be used to determine other characteristics of the process.

In the first subsection below we briefly overview two groups of publications operating with quite sophisticated probabilistic approaches that need to be further discussed in terms of the probability theory. The second subsection is devoted to the coalescent theory.

1.7.1. Evolution as a Stochastic Process

A type of stochastic processes other than in [25] is considered in [103]. Fix a gene tree and a species tree , with tube lengths corresponding to times; paths from the root to each leaf have equal lengths. An oracle is fixed that assigns to each natural number and tube the probability of the outcome “ contains exactly duplications.” Here, a mapping of vertices in into vertices and tubes in is defined under the condition: if for any child of the inequality is valid, then or is a tube having as its lower terminus.

A probability of is the probability of tube to contain exactly duplications multiplied over all in . Recall that the sign stands for the number of the set elements, the cardinality of the set. To find are (i) the highest likelihood among all possible mappings (ii) the mapping with the highest likelihood itself, and (iii) the numbers of duplications in each tube under the mapping . The authors describe a polynomial of degree 5 heuristic algorithm for (i) and exponential complexity algorithms that find exact solutions for each of the three tasks.

The following statement is considered in [104]. Fix and (the root tube is located upwards the root in ) with tube lengths corresponding to times and and being the intensities of duplications and losses, respectively. The intensities are constant across all tubes and are parameters of a linear death-birth process (its formal definition is provided at the end of this subsection).

For each vertex in denote as the set ; that is, contains vertices and tubes from for a variable mapping and a fixed argument . Call mappings and adjacent if the following conditions are valid: and differ at exactly one vertex , and are comparable in (in terms of the natural order on tree ), and there exist no elements from strictly in-between them.

In [104] a tree is defined and generated by the below stochastic process; is then compared with the initial tree . Let us first informally describe the stochastic process for . The root tube of contains the start of gene lineages that descend downwards and bifurcate at each vertex of (the divergence events). In each tube, each gene lineage undergoes duplications or losses with given intensities and . In case of a loss of the lineage terminates, in case of a duplication, it bifurcates into two descendent lineages in this tube. All lineages terminate in leaves, and only then the process ends to generate a tree (inside the tubes) and its natural mapping into ; all lineages terminated before leaves are discarded and not included in . The tree and its natural mapping into tree are generated in any realization of this random process from the root toward leaves in . More precisely, arrange slices by ascending order when the current total amount of lineages changes by 1. Let the root part of the tree be generated at instant . If at that instant the number of lineages in a tube increases by 1, a lineage is chosen equiprobably in the tube and bifurcated; if it decreases by 1, this lineage terminates.

The probability of mapping is the probability that the random process generates a tree isomorphic to the tree through mapping . The probability of tree is the sum of probabilities of all its mappings in ; a conditional probability is defined as . By substituting in the denominator with a sum over a given subset of mappings (defined ), we obtain the definition of a -approximated conditional probability and denote it .

Define a graph, where all vertices are mappings of into and edges connect adjacent vertices. Fix an arbitrary spanning tree in the graph that is rooted by mapping ; and let be a connected subgraph with vertices in . In [104] the authors prove the following: for all mappings from , the probability is computed with the time and memory of and the -approximation —with the time and memory of .

Experiments with biological data were performed to obtain realistic values of intensities and of duplications and losses.

A -probability is the sum of conditional probabilities of all mappings from , which are separated from the root by maximum edges; such mappings are called -mappings. Computer simulations showed that, (i) with the increase of (from 0), the -probability soon reaches the plateau and (ii) for each mapping before the plateau, the value approximates with high accuracy if includes all mappings before the plateau.

An algorithm realizing the approach of [104] was developed and applied to biological data in [105]. Earlier related results are in [104, 105].

Probabilistic modeling of gene evolution can also be applied to model sequence divergence, as described in [106] and, with more detail, in [107]. A model and an algorithm are proposed in [107] to simultaneously infer gene trees, the species tree, and expectations of duplications and losses in each tube of the species tree, given a set of multiple alignments. Further relevant references are provided in [107].

A Formal Description of the above Described Process. Let be a linear death-birth process applied to the tree . The process argument is time taking on a value from 0 to , where is the path length between the root and a leaf. At each vertex define time as the length of the path from the vertex to the root; tube ( closer to the root) “contains the instant” if . The value of is a set of pairs: a tube possessing instant and the number of gene lineages in the tube at instant . The definition of is as follows: is a set consisting of the single pair ; that is, ; let for a nonroot tube hold ; then, by induction from root to leaves assume that equals , where is the parent tube for ; determine the change of in a small time interval of the argument such that contains . For each tube possessing instant , define conditional (transition) probabilities of the number of gene lineages at the instant if at the previous value is known:
where is duplications intensity, and
where is losses intensity,

The end of process definition.

1.7.2. Coalescent Approaches

In this group of studies, modeling the evolution of genes along a species tree includes a novel approach and an evolutionary event of novel type, the incomplete lineage sorting (refer to [108, 109] for the theory and references provided therein). Below we go with some detail into this important concept, which is grounded on the mathematical theory of the reverse time. On trees, the “direct time” refers to the time directed from the root to the leaves, and the “reverse time” reverses this direction.

The problem of reversing time in stochastic processes was first visited by Kolmogorov [110]. Kingman [111] had found that the probability distribution on phylogenies in large populations is described by a special type of random processes named coalescence and analyzed them in reverse time using the earlier model of population evolution proposed by Wright [112] and Fisher [113]. These works laid the foundation of the coalescence theory, and Kingman gave it further development and formulated it for continuous time.

The Central Idea of the Model in Short. Consider the sets of “parents” and “children,” and , at th and ()th generations, each consisting of elements. Assume that a multivalued mapping of into is surjective and satisfies the condition: the values of any two different elements from are disjoint subsets of . Such the mapping is an inverse of the single-valued mapping of into (informally, children are mapped into their parent). There is exactly different mappings , which are equiprobable, and thus each has the probability . It is also assumed that for all generations all are independent random maps.

This simple description is equivalent to conventional definitions of the Wright-Fisher-Kingman model, which we describe below for the comparison.

The jth individual (“parent”) from produces individuals (“children”) in generation and dies; are supposed to be mutually independent (over ) and follow the Poisson distribution:
Let a population contain individuals at each generation ; that is, the condition must be imposed to fix the population size over generations. The joint distribution of is multinomial:
where .

The map can be interpreted as the random choice of a parent from by each individual from ; this choice is equiprobable. The latter formula defines the probability of the event “the first parent has children, the next parent has children, and so on down to ”.

The evolution in reverse time is a transition from to and deeper toward the root. The probability of any two individuals from having different parents is ; and having different parents in preceding generations (down to ) is . The probability of fixed different individuals from having different parents in one preceding generation is
and having different parents in preceding generations (down to ) is . Consider a limit of with variable and . The lifetime of one generation is assumed to be ; that is, in time , one observes generations, . The probability of fixed individuals having different parents (in the limit under ) over fixed time (the lifetime of generations) is

A simple interpretation of the last formula: individuals can form pairs, the probability that any pair of individuals over time does not share a common parent is . In random time (with the parameter ) a random pair of individuals is chosen and assigned a parent. Time is defined by the random variable distributed exponentially as . The choice of the pair is equiprobable because the probability of choosing a pair from possible pairs is . An unpaired individual is a parent of itself. The obtained parents are further paired with each other analogously until no further pairing is possible.

This process can be described as building a phylogenetic tree for given individuals (leaves). In a next (inductive) tree level in the direction root ward, a pair is chosen from current parent individuals and coupled under a new common node to form two new edges with the same length equal to the value of random variable distributed exponentially with the parameter (at the start of induction ):
An unpaired individual is projected on the next level and forms a new vertex and a new edge of length connecting the two vertices. Thus, an unpaired individual is parental to itself.

This process is called the coalescence and ends with building a rooted tree with leaves.

Other coalescences are considered in [109, 114]. Namely, fix individuals at a time instant, with mutants and wild type, . Assume that all mutants evolve from one parent that acquired a single mutation and all its descendants are mutants. Denote , the extant population of mutants and by , the population of extant wild types. The genealogy of bearing the mutation is a subtree in the genealogy of the whole population, the union of and . The coalescence process is used to find a phylogenetic tree such that lineages of coalesce earlier than any lineage from forms a common parent with any lineage from . A coalescence that satisfies this constraint is called conditional [114]. Algorithmically, the coalescence tree building process is running multiple times until the described tree containing the clade is built.

Another constraint imposed on the tree building process is studied in detail in [109]. Namely, consider a species tree with lengths (times) given for all tubes such that all paths from any node to the leaves are equal, thus defining the age of the node. Conditional coalescence can be applied to build a gene tree along with its mapping into , that is, the evolution of the gene inside the species tree [109]. At the start of induction, each gene leaf is assigned to its corresponding species leaf in the tree . In reverse time, in the resulting tree any two gene leaves existing in two different species tubes form the common gene parent of at least the age of the common parent of the corresponding species. Thus, gene lineages may coalesce much later than their containing species. Such an event is called the incomplete lineage sorting.

The described process is applied to the case when the mutant is replaced with a duplicated copy of a gene that acquired a mutation after the duplication [108] had occurred. If a part of a population undergoes genetic change, it may result in the formation of subspecies. After the speciation event, the change is usually fixed in this subspecies. The duplicated copy of a gene survives, in contrast with the models like mapping that operate with species as discrete units.

Study [108] also introduces the interim locus tree concept based on conditional coalescence. A gene tree is mapped into the interim locus tree, which then maps into the species tree. The species tree evolves in direct time, from root to leaves. However certain ideas in the description of this approach remain hard to understand.

2. Constructing the Evolutionary Scenario and the Supertree: Algorithms and Proofs

This section (Sections 2.1–2.13) of the article describes the original solutions and corresponding mathematical proofs proposed by K. Y. Gorbunov and V. A. Lyubetsky for the two problems in the field of trees reconciliation: inferring gene evolution along a species tree and trees reconciliation into a single tree (including the case of polytomous trees). These developments apply to a diverse and important subject of the evolution of species, genes, and their regulatory systems considered in concert or separately.

2.1. Statement of Two Problems

Studies [23, 25, 45, 64, 79] tackle two important and sophisticated problems in bioinformatics. The obtained results are partially reviewed in Section 1 of the paper, which also provides an extended biological background and relevant references.

The first problem is to reconstruct a gene evolution along a species tree or, in other words, to construct a mapping of a gene tree into a species tree and to build the scenario. The second problem is to reconcile a set of gene trees into one common species tree. A specific facet of the second problem is to build a supertree (by globally minimizing a suitable functional commonly referred to as the “cost”) for the given set of trees. This problem is extended to the hard case of polytomous data, especially polytomous input trees.

In the above-mentioned works [23, 25, 45, 64, 79] only concise formulations are provided, while in this section we give mathematical statements and proofs to describe the two problems on the case when only gene duplication, loss, and divergence during speciation are the considered evolutionary events. Following on, we describe in detail the extension of the developed algorithms to incorporate other types of gene evolution events and/or the case of polytomous gene trees.

The first problem is solved in polynomial (often linear, at maximum cubic) time even for the case of incorporating time slices and horizontal gene transfers. In Section 2.7 it is proved that the corresponding original algorithm of cubic time complexity finds exactly the global minimum; that is, the model is exactly solvable.

In its traditional statement, the second problem cannot be solved algorithmically in polynomial time, as it is proved to be NP-complete. Known exponential solutions (based on various enumerators) are computationally too intensive, and do not guarantee that the optimal solution (the global minimum of a functional) is found if heuristics are applied to stop the search. Moreover, the accuracy of approximating the global minimum by a heuristic solution is not clear at all.

Complete proofs are first given for the case of no time slices and gene transfers. The discussion of the second problem follows next. A solving algorithm cubic of the initial data size is suggested that finds the exact conditional (refer to Section 2.5) global minimum under no gene transfers; that is, in this case the model is also exactly solvable. However, for the case of transfers the algorithm is not mathematically proved to find the exact conditional global minimum, which remains an important open problem. The heuristic solution for this case and its usage are described in [23, 25, 45, 64, 79].

We end this section with giving a solid mathematical background for the second problem for a fixed set of polytomous rooted gene trees. This problem is also discussed in [25].

2.2. Auxiliary Definitions

Let a gene tree and a species tree be given. The trees are rooted and binary, and oriented downwards from the roots. Recall that edges of the tree are referred to as tubes to distinguish between the edges of and . Each root is supplied with an additional root edge (or root tube), which initiates in a superroot and ends in the root; that is, the superroot is the only vertex inducing the single child. Each leaf is labeled with a species name. Species names in are unique; species names in may duplicate if it contains several genes from the same species (paralogous genes). Species names in are a subset of those in . A subtree is a part of a tree that consists of a vertex, an edge entering the vertex from above (the subtree root edge), and all vertices and edges descending downwards. A clade of a subtree is a set of species names present in all its leaves; a clade of a vertex is the clade of its subtree. For a clade , the corresponding tree is referred to as a tree over V. A paralogous subtree (with respect to a species) in is such a maximal subtree that has all leaves marked with one species (i.e., its clade is a singleton; the paralogs are in-paralogs for this species). Pruning of a subtree from tree is a deletion of all edges and vertices in belonging to followed by merging of the two edges, incoming in and outcoming from the upper terminus of the root edge in . A child of a vertex is another vertex located directly downwards, that is, at a distance of one edge. Remember that and mean the natural order on any tree as defined in Section 1.2. The natural order relation is defined analogously on a set of edges (tubes), a set of vertices, or a united set of edges and vertices. The terms “lower” and “upper” refer to the natural order of the tree branching downwards from the root.

Let be the lower terminus of edge , and let be its upper terminus.

2.3. Definition of Mapping with Duplications and Losses Only: Reconciling Gene and Species Trees

A mapping of a gene tree into a species tree is an assignment of each vertex in to a vertex or tube in , the superroot is mapped into the root tube, and each leaf is mapped into a leaf with the same species name. Two conditions are imposed on : if a vertex is mapped into a tube, its child is mapped into the same tube or downwards (lower); if is a vertex in , then for children and of , the values and are in the two different descendent (lower) subtrees of in .

2.4. Definitions of Gene Duplication and Loss and Their Localization on the Species Tree

Let mapping be fixed. In , a duplication is a nonsuperroot vertex for which is a tube, a divergence is a nonleaf vertex for which is a vertex, and a loss is a pair of edge in and vertex in such that, for the upper terminus and the lower terminus of , we observe . If the clade of a child of contains no species from , the loss is called implicit (as it is induced by species in but not in ). Otherwise, the loss is called explicit. A duplication is located in the corresponding tube in , a divergence in the corresponding vertex in , and a loss in vertex .

Each event type (duplication, loss, divergence, etc.) is assigned a nonnegative cost value. A cost of mapping of into is the sum of event costs inferred in this mapping. A cost of mapping of a set of gene trees into a species tree is the total cost of mappings of into . Denote these costs and , respectively. The variables and are often implied but not written explicitly.

A mapping with the minimal cost is called canonic and designated , [33, 63]. A linear algorithm to construct it is described in [65, 66]; more details can be found in [20].

Denote a set of all species names in all given gene trees .

2.5. Formulating the Problems of Reconciling Two (Gene and Species) and Many (Gene) Trees

During the reconciliation of two trees, for given gene and species trees, a mapping is sought for such that it globally minimizes the functional over the variable .

During the reconciliation of many trees, for a given set of gene trees, a set of mappings and a tree are sought for such that they globally minimize the functional over the variables and . This minimization is done under the ad hoc condition: each must contain only clades belonging to a predefined set of subsets of set ; all clades from are by default already contained in .

Traditionally, the second problem requires the unconditional (absolute) minimization. We refer to the introduced reformulation as to the parametric (over the parameter ) or conditional minimization (optimization).

2.6. The First Problem under Gene Duplications and Losses Only: Reconciling Gene and Species Trees

If is not the superroot vertex in tree , denote the last common ancestor in of a clade defined in by a subtree with the root vertex . A second definition of canonic mapping slightly differs from the definition provided in Section 1.2 as follows. Let , where is the root tube. If for both children of vertex holds the inequality , then ; otherwise is a tube incoming to from the upwards. Informally, may be visualized as located “inside the tube.” Hereafter, only the second definition of canonic mapping is used. Analogously, a set of mappings is canonic if each (of into ) is canonic. From the remark to Lemma 2 it follows that the second definition of and its definition based on the global cost minimization are equivalent. The second definition is given in [33, 63].

Lemma 1. If mapping is not canonic , then for each vertex the inequality is valid, and, at least for one , .

Proof. Clade contains clade , that is proved with induction from leaves to . The first inequality follows from the statement above and the observation: if is a tube, then is not its lower terminus, as the terminus already contains a descendant of .By definition, cannot map two comparable vertices to one. The condition implies the last statement of the lemma.

Lemma 2. For any mapping different from , the amount of duplications for is not less than for , and the amount of losses is strictly greater for than for .

Proof. Consider a duplication for ; that is, , where is a tube. Then cannot be a vertex , as then, by definition of mapping, one of the children of must map in a descendent subtree of not containing . It is impossible, as the clade does not intersect with the clade of the descendent subtree (it is further referred to as the bifurcation effect). According to Lemma 1, , therefore is a tube. Consequently, a duplication for remains a duplication for . Note that a divergence for may become a duplication for .First prove that the amount of losses is not less for than for . Consider a loss for . If , it remains a loss for , because, according to Lemma 1, . The equality is false due to the bifurcation effect.Next, due to obtain that is comparable with . If , the loss () corresponds to at least two losses, () and (), in for , where and both . Indeed, on any path from downwards, an edge will induce a loss in (a divergence cannot occur on the path due to the bifurcation effect). If is another loss for , it corresponds to two losses in differing by or , given that is incomparable with . Thus, there exists a multivalued injective mapping that maps each loss in to one or two losses in , with nonintersecting images. Since for a fixed vertex the property of being an explicit or implicit loss in depends on the tree only, and losses for are of the same type (explicit or implicit) as for .The Last Statement of the Lemma. By the condition and Lemma 1, there exists a vertex in , for which . The two cases are possible: (i) and are tubes, (ii) is a tube, and is a vertex. Indeed, for a vertex a contradiction arises according to the bifurcation effect.Case (i). Let be an arbitrary vertex, for which . Consider two nonoverlapping paths from to the leaves. On both paths there occur edges and inducing for losses and in (the bifurcation effect). As , the paths from and to the root contain either none or one coincident loss in for . Consequently, the losses and either are not contained in the mapping image or constitute the image of one coincident loss. In both alternatives, the amount of losses is greater for than for . Case (ii). Consider an arbitrary path from to a leaf. According to the bifurcation effect, it contains an edge such that is a loss for . This loss is not contained in the mapping image , as there exists no edge on the path from to the root such that is a loss for .

Remark 3. Let and be two different mappings. If for any vertex holds the inequality , then by substituting to in Lemma 2 we prove that the amounts of duplications for is not less than for , and the amount of losses is greater for than for . An analogous statement is proved in [115] for vertex-to-vertex mapping functions.

Henceforth, assume that the cost of a divergence is less than the cost of a duplication; this condition is likely to be biologically justified. Then, by Lemma 2, a canonic mapping is a solution of the first problem, that is, the two definitions of coincide. Further, if in a set a mapping is not canonic, then its replacement with a canonic mapping will reduce the total cost. Thus, in the second problem the only true variable is the desired species tree .

Lemmas 1-2 solve the first problem only for the case when the gene duplication, loss, and divergence during speciation are considered. Lemmas 4–7 prove certain properties of and will be used in the proofs of Theorems 8-9.

Lemma 4. If a gene tree is obtained from a species tree by pruning some subtrees from , then for a canonic mapping of into duplications and explicit losses are absent, and each pruned subtree (with the root tube ) induces an implicit loss in . Conversely, if for a canonic there are no duplications, then is obtained from by pruning some subtrees.

Proof. Prove the lack of duplications with induction on the amount of pruned subtrees. At the start of induction, , and only a divergence event is possible.An induction step from to , where is the number of pruned subtrees. If in it is true that is a vertex and a vertex is not pruned, then both clades of its children in and are subsets of the clades of corresponding children of vertex . These children in still map in strictly below . Therefore, in also , vertices in map into vertices, and a duplication does not occur.Prove the absence of explicit losses by contradiction. An induction step. Let a vertex contain an explicit loss after pruning a ()th subtree . Then in tree both clades of the children of contain species from . Consequently, there exists a vertex , for which , as such existed in and was not pruned. Thus, edge does not exist.A vertex of tree , a former image of the upper terminus of the root edge of a pruned subtree, contains an implicit loss in induced by a new edge in the tree formed after merging of two initial edges.Let us prove that, for any vertex in , a set , where is a subtree rooted in , defines a tree obtained from by pruning certain subtrees. If is a root, this statement is obtained. Prove it with induction. If is a leaf, the statement is obvious. An induction step. Assume a vertex , for which . Then the sought subtrees set is the union of the corresponding sets for children and of vertex . Assume there is none such . Then the images of members of belong to one child subtree of vertex (put it the child ); otherwise will contain the last common ancestor of members of . The sought set of subtrees consists of a subtree rooted in and the set of subtrees for .

Lemma 5. In a canonic of into , each leaf tube terminating in species contains the number of duplications equal to the number of nonleaf vertices in a paralogous subtrees for in the gene tree .

Proof. Denote the number of nonleaf vertices in Lemma 5 by . Any internal vertex of a paralogous subtrees induces a duplication according to the definition of . And, conversely, such a duplication corresponds to a vertex in that is contained in a paralogous subtree for .

Lemma 6. Fix a gene tree over a subset of . Let species trees and be both defined over , each containing a certain subtree . Then the two canonic mappings of into and into produce the same set of events in ; that is, the set of events in a subtree does not depend on the subtree’s complement (the rest of the tree).

Proof. Let be a clade of the subtree . Vertices mapped in coincide in both mappings, as their clades belong to ; the image of such vertices coincides in both mappings, as it depends only on . By definition of the duplication, the set of duplications in coincides in both mappings. Let be a loss in one mapping , where is a vertex in . Then in another mapping the image of edge remains constant, and the image of also remains constant (if belongs to ) or remains external to (if does not belong to ) above the image of and, therefore, above . In both cases, is a loss also in another mapping . Thus, the set of losses also coincides between two mappings .

Lemma 7. Fix a gene tree over a subset of . Let be a subset of , and a species tree over contains a subtree over . Let a tree be derived from by substituting the subtree with a subtree over . Then the canonic mappings of into and into produce the same set of events in the complements to the subtrees and ; that is, the set of events in a complement to a subtree does not depend on this subtree.

Proof. By definition of mapping , vertices mapped outside are the same for , as their clades do not belong to , or, equivalently, their LCA images are not contained in . Each such vertex has the same -image. Indeed, the values of coincide on and ; that is, if on one of the the is a tube, it is a tube on the other. By definition of a duplication, the set of duplications outside coincides between the two mappings. Let be a loss in one mapping, where does not belong to . Then in the other mapping the image of does not change, and that of either does not change (if not belong to ) or remains in (if belongs to ) and thus is below . In both cases, is a loss also in the other mapping. Consequently, the set of losses also coincides between the two mappings.

2.7. The First Problem under Gene Duplications, Losses, and Horizontal Transfers with Imposed Time Slices: An Algorithm to Reconcile Gene and Species Trees (Building an Evolutionary Scenario)

The generalization of mapping to incorporate gene transfers has long been a daunting task. Here we describe an original approach to solve it.

Let the species tree impose certain time slices; refer to Sections 1.4-1.5; the slices are ranked from the root to leaves. The slices must satisfy the single condition: if , then the rank of is not less than the rank of . For example, a th slice contains all tubes distanced by the amount of tubes from the root; in [25] the slices are constructed with an additional condition: all leaf tubes belong to one slice. The latter condition is inessential in further definitions and is accepted without discussion. Denote for tubes and if and , belong to the same time slice.

With horizontal transfers, we formulate a similar (refer to Sections 2.3, 2.6) but inductive definition of mapping of a gene tree into a species tree and its cost [64]. Simultaneously with , an additional tree is defined as derived from by inserting new vertices with a single child. The number of new vertices on an edge defines the number of transfers: if is even, a gene (more precisely, edge in ; see below) underwent transfers without retention of the gene donor copy, and if is odd, a single transfer with and transfers without retention.

Let be any edge in ,and let be any tube in . The definition of and is based on an important auxiliary definition of the inner tree and its cost for any pair. All pairs of the form are partially ordered: a pair is lower than if or and the rank of tube is greater than that of . Pairs are visited from leaves to the root in the linear order consistent with the described partial order. Remember that any vertex is identified by its incoming edge.

2.7.1. Defining the Inner Tree for the Pair

The start of induction. Let and be any leaf edge and any leaf tube, respectively, and let be a tube with the species of gene in its lower terminus. If , the inner tree contains the pair and its single child, the pair ; this corresponds to a transfer without retention of the donor copy from into . If , the inner tree consists of the single pair . The cost of this tree is the cost of a transfer without retention if and is zero otherwise (for more details on transfers refer to Sections 1.4-1.5) [23, 25, 45, 64, 79].

Thus, the inner tree is a marked tree; the mark of a vertex has the form .

2.7.2. An Induction Step

Let and be a nonleaf edge and tube, respectively. Then the inner tree and its cost for the pair are defined as follows depending on the sequential choices listed below. Namely, for any , the outcome is selected according to rules 1–6 below, with some of the rules describing a choice. In square brackets is the description of applicability. Otherwise said, a set of inner trees is defined, with each inner tree describing an alternative evolution of gene inside species .

(1) [Tube has the single child ]. The inner tree consists of the pair with the single child that roots the already known inner tree for . This tree has the cost equal to that of . Descriptively, lineage enters the next tube.

(2) [Tube has two children, and ]. The inner tree consists of the pair with the single child that roots the inner tree for or the child that roots the inner tree for (only one case must be chosen). The cost of this tree is the cost of the chosen plus the cost of a loss (explicit if the other child possesses at least one leaf from and implicit otherwise). Descriptively, lineage survives only in one of the two tubes.

(3) [Edge has children and ]. The inner tree consists of the pair with two children, and , which root the inner trees for pairs and . Its cost is the sum of costs of trees and and a duplication. Descriptively, lineage is duplicated in .

(4) [Edge has children and ; tube has children and ]. The inner tree consists of the pair with two children, and , which root the inner trees for pairs and . Its cost is the sum of costs of trees , , and a divergence. In the alternative choice, and swap. Descriptively, lineage diverges in .

(5) [Edge has children and ]. The inner tree consists of the pair with two children, and , which root the trees for pairs and , where . Its cost is the sum of costs of the trees for , , and a transfer with retention. In the alternative choice, and swap. Descriptively, lineage duplicates in with a subsequent transfer into and retention of the donor copy in .

In rule 6 the definition of is used in the same sense.

(6) In this rule, descriptively, lineage duplicates in with subsequent transfers into and losses of the donor copy in .

(6.1) [Tube has the single child ]. The inner tree consists of the pair with the single child , which also produced the single child that roots the tree for . The cost of this tree is the sum of costs of and a transfer without retention. Descriptively, lineage enters from into the next tube .

(6.2) [Tube has two children, and ]. The inner tree consists of the pair with the single child , which also produces the single child that roots the tree for . The cost of the tree is the sum of costs: , a transfer without retention, and a loss in (explicit if possesses at least one leaf from and implicit otherwise). The alternative is the choice for . Descriptively, lineage survives only in one of the two tubes.

(6.3) [Edge has children and ]. The inner tree consists of the pair with the single child , which produces two children, and , which root the trees for and . The cost of the tree is the sum of costs of , , a transfer without retention, and a duplication in . Descriptively, lineage duplicates in .

(6.4) [Edge has children and ; tube has children and ]. The inner tree consists of the pair with the single child that produces two children, and , which root the trees for and . The cost of this tree is the sum of costs: , , a transfer without retention, and a divergence. In the alternative choice, and swap. Descriptively, lineage transfers in and then diverges in the lower terminus of .

(6.5) [Edge has children and ]. The inner tree consists of the pair with the single child that produces two children, and , which root the trees for and , where (tube differs from tubes and ). The cost of the tree is the sum of costs of , , and transfers with and without retention. Descriptively, lineage transfers in , duplicates in with a subsequent transfer into and retention of the donor copy in . The end of the inner tree definition.

Remember the notation: subscript and superscript indices of “+” designate lower and upper termini, respectively, or edges and tubes; and are the root edges in trees and .

The inner tree for the pair is used to construct a candidate mapping and simultaneously a candidate tree , which vertices are mapped into vertices and tubes of tree . Namely, when running the vertices of an inner tree for the pair from its leaves upwards to the root consider the following. Let and be children of edge , and let , be children of tube . In square brackets is the description of applicability followed by the rule formulation. Each pair marks the corresponding vertex in tree :

(0) ;

(1) [leaf vertex ] ;

(2) [vertex has a child of the form ] . If the other child has the form , a new vertex (with the single child) is inserted on edge in current and . If the edge already received a number of single-child vertices, a new single-child vertex is inserted in the edge upwards of the already received;

(3) [ has the children and ] ;

(4) [ has the single child ]. Insert two vertices and on edge in current (each with the single child; is higher than ) and , .

The set of candidate mappings of into is obtained. Candidate partial mappings for any pair are obtained analogously, as well as candidate partial trees . The end of the candidate mapping definition.

A scenario (mapping) is a candidate mapping that minimizes the total cost of its evolutionary events.

The role of the inner tree for is to describe the evolution of a gene described by a tree inside the species described by a tree ; if a pair is a vertex of the inner tree then edge evolves inside tube at least along its certain segment.

An algorithm to build the scenario trivially repeats the same induction that was used to define the inner tree: for every pair , the choice will minimize the cost over all possible choices. The same induction is used to build the mapping that coincides with canonic when transfers are not considered.

To account for gene gain events, we introduce an auxiliary outgroup, a tube connecting the root of with an auxiliary outgroup species . Introducing time slices generates tubes on the outgroup tube with single children, which we also denote . Gene lineage that evolves into the outgroup tube and later transfers back into the initial species tree is considered as gained. The start of induction is modified as follows: for , the cost of a transfer without retention is replaced with a fixed gain cost. Induction steps are also modified. In rules 2-3, the costs of loss and duplication are zeroed for and , respectively. Rule 3′ is added: for a pair , the inner tree consists of with two children, and , which root the inner trees for pairs and , where and are children of , and is the upper outgroup tube. The cost of this tree is the sum of costs for and . In the alternative choice, and swap. In rule 4, the cost of a divergence is zero for . A condition is added in rules 5, 6.1, 6.2, 6.3, 6.4, and 6.5: tubes , are not in the outgroup; for , the cost of a transfer with retention (rule 5) or without retention (rule 6) is replaced by the gain cost.

In [25] we describe an even more extended list of evolutionary events. The nontrivial definitions and algorithm above were proposed and thoroughly tested in [45]. In [79] the complexity of the algorithm was mathematically proved to be cubic with respect to the number of vertices in the species tree that contains time slices. In [45] it was mathematically proved that the algorithm finds the minimal mapping and its cost under the presence of horizontal transfers.

The first problem is solved for the general case.

2.8. The Second Problem: Phase 1 of the Supertree Building Algorithm under Gene Duplications and Losses Only

Hereafter, all mappings are canonic . Only duplication, loss, and divergence events are considered.

Consider a set of gene trees with a set of species called . To find is a species tree over , for which the total cost of individual tree mappings is globally minimal. It is an NP-complete problem. To overcome this limitation, we reformulate the problem of unconstrained optimization into a biologically justified constrained (conditional) optimization problem. Constrain the solution space to contain only species trees satisfying the condition: all clades of belong to a predefined set , which includes at least all clades of input gene trees. Thus, must also satisfy this condition. The parameter is nontrivial and is introduced to overcome the NP-complete nature of the problem. A “true” species tree may not exist in this solution space, depending on the degree of consistency of the input set of clades.

The proposed original algorithm of solving the second problem consists of two phases. An exact solution is obtained during Phase 1, provided that the conditional optimization problem is solved under a certain condition.

If the condition is not valid, a follow-up heuristic procedure implemented in Phase 2 can be invoked, which outcome depends on the data generated during Phase 1. As with real data the existence of the unconstrained solution in the solution space for a fixed is usually unknown, one can either empirically expand the set or take the heuristic solution obtained during Phase 2. In computer simulations the latter strategy produced better results (data not shown).

Description of Phase 1. Standard approaches are used to define algorithmic relations over sets from : the “inclusion of one set into another,” “intersection of two sets is empty,” and “cardinality of a set.” Also, the algorithmic relation is defined between vertices of (separately for each ) and their clades from . Different vertices (even within one tree) may correspond to the same clade; the set may contain sets that do not correspond to any clade in the input gene trees.

For each set from the set of all its partitions is defined. A partition is a pair of nonempty nonintersecting subsets , of set that belong to and their union equals ; partitions are easily calculated by verifying the condition . Sets from that can be so partitioned down to singletons are defined as basic; all singletons are also defined as basic. The set may contain nonbasic sets. Thus, an initial may be nonbasic, which invokes Phase 2 of the algorithm. By induction, we enumerate all basic sets according to the increasing of their cardinality. For each basic set, Phase 1 constructs a tree over , called a basic tree, and computes its cost. In the algorithm implementation, the construction of basic trees and computing their costs are naturally combined. For any singleton from , tree contains the single leaf (the root) and the root tube; its cost is zero if there are no paralogous trees for and is the cost of one duplication multiplied by otherwise (refer to Lemma 5).

2.8.1. Definition of Basic Trees () and Their Costs: The Induction Step

Fix nonsingleton basic set from and enumerate all its partitions into basic sets and with lesser cardinality.

For each partition, compute a new cost as follows. Denote the clade of a vertex . Let and be children of ; if is a superroot, then . Run each in all and compute the following numbers , , , and .

The number of vertices in all , for which and ; (or otherwise: and (the sign stands for “a subset”)); the number of vertices in all , for which is a subset of and at least one of the sets or has non-empty intersection both with and with .

Select gene trees for which (i) the root clade intersects with both sets and and (ii) the root clade intersects with one of the sets and not with the other.

Compute the number of edges in all satisfying (i) and the new condition (iii): is a subset of or , and either is the superroot, or for the child of , the set is a subset neither of nor of . Also compute the number of edges in all satisfying (ii) and (iii).

Define a new cost
where is the cost of a divergence, is the cost a duplication, is the cost of an explicit loss, and is the cost of an implicit loss.

Assume that is the minimal cost among for all partitions of . The tree is obtained by merging trees and under the join root, where is one of the pairs satisfying the minimal cost requirement. The cost of is defined as .

Phase 1 outputs a set of basic trees for each basic set . The end of Phase 1.

2.9. Justification of Phase 1

Let be an arbitrary species tree over that includes a subtree . Denote the total cost of events in in canonic mappings of all gene trees in . The cost differs from the total cost as it accounts only for the events in ; of course, if , the costs are equal. If any tree over is considered that contains as a subtree, the cost will remain the same as for according to Lemma 6. Thus, the cost is a function of the tree and does not depend on its comprising tree .

Evidently, if the second conditional problem is solvable, then is a basic set, and the tree ) is the solution according to Theorem 8.

Theorem 8. A basic tree globally minimizes the functional in the conditional problem for if the problem is solvable. The algorithm constructs ) in time , where is the number of input trees .

Proof. Obviously, the solution exists if and only if is a basic set. The time complexity is proved in [64].By induction, enumerate basic sets according to the increasing of their cardinality. For a singleton set, the statement of Theorem 8 follows from Lemma 5. Let be a nonsingleton set. Prove that, for each partition of into and , the computed value equals the sum of event costs in a tree , where is a result of merging trees ) and ) under the common root (as mentioned above, the value depends on only). Denote the common root, and —the tube entering the root (the root tube). There are three groups of considered events: (i) events in ), (ii) events in ), and (iii) events occurring in or in . By inductive assumption, the total event cost of groups (i) and (ii) is .Examine the total event cost of group (iii). From definitions of mapping and the events, it easily follows that(1) (a divergence event) if and only if the condition on corresponding to the number in the algorithm description is satisfied;(2) (a duplication event) if and only if the condition on corresponding to the number in the algorithm description is satisfied;(3) pair is a loss if and only if condition (iii) in the algorithm description is satisfied; the loss is explicit if condition (i) on is satisfied and implicit if condition (ii) is satisfied.Thus, the algorithm finds the numbers of duplications in , divergences in , explicit and implicit losses in , and their total cost. Consequently, the value is computed correctly.Let a certain tree be the global minimum of the functional if all its clades belong to the set . The root bifurcation corresponds to a partition of into two basic sets, and . If subtrees and are replaced with trees and , respectively, then by Lemma 7 the functional does not decrease (indeed, if, e.g., is replaced by , the cost of the events from group (i) does not decrease, and the total cost of groups (ii) and (iii) remains constant). Consequently, such a replacement does not affect the global minimum, and trees over and in the desired solution can be legitimately considered those and that are already constructed at previous steps of the algorithm. The algorithm will output as the global minimum of the functional .

2.9.1. Remark

According to Lemma 5, the cost includes the total cost of duplications in all paralogous subtrees over all over all species from . Therefore, the costs of singletons can be any constants, as the optimal tree does not depend on them. The set can also be simplified by replacing all paralogous subtrees with singleton subtrees.

Phase 1 of the algorithm produces a set of basic trees, where runs over all basic sets. If the set of all species is not basic, it will not contain a tree over . In this case, Phase 1 returns no conditional supertree; that is, the conditional problem has no solution.

A natural question is “how to determine if the degree of consistency of the input set of trees suffices for the correct supertree to exist?” An empiric directive for the moment can be that the trees are consistent enough if is a basic set.

2.10. The Second Problem: Phase 2 of the Supertree Building Algorithm

The set is not unambiguously defined by the initial set of gene trees . For this reason, a heuristics is implemented in Phase 2 of the algorithm to solve the unconditional problem and assemble basic trees into one species tree over under a certain fixed . This heuristic solution largely depends on the outcome of Phase 1. The assembling can be done using a variety of known methods. We propose an original ad hoc “augmentation” method described below.

Consider a tree over a set . Its cost is defined as the total cost of mappings of all basic trees (with two or more leaves) pruned to contain only species from the set .

Let contain only three species. The basic cost is the minimal cost among all trees over . The subbasic cost is the minimal cost strictly greater than . The reliability is defined as . By enumerating all such , find a tree over with a nonzero reliability and the minimal value of . If for any the cost does not exist, the algorithm terminates. The final tree is the result of the basis of the induction.

An inductive step is similar. Let a tree with species be obtained. Consider all pairs: species from not contained in and edge from including its root edge. The edge is broken in two by inserting a new vertex connected with a newly added leaf , thus generating a new tree . The basic cost is the minimal cost when is fixed and is a variable. The subbasic cost is the minimal cost strictly greater than . The reliability is defined as above. By enumerating all find a tree with a nonzero reliability and for which is minimal. If does not exist for a species , the species is marked as unreliable and not used in Phase 2. An augmentation step is a transition from to ; the steps are continued until the current contains all successfully attempted species from . The resulting species tree is the output of Phase 2 of the algorithm.

The correctness of Phase 2 is proved by Theorem 9. Informally, the topologies of trees in Theorem 9 are assumed to share at least some topological similarity.

Theorem 9. Let the cost of an implicit loss be zero. If there exists a tree over such that each basic tree can be obtained by pruning to contain only species from , then the augmentation leads to a species tree with the zero cost, and the conditional problem is solved. The converse statement is also true.

Proof. In the first statement additionally, intermediate trees also have zero costs.If a tree over is obtained by pruning the tree , then all basic trees pruned to are also prunings of , and, by Lemma 4, the tree has the zero cost. Thus, the augmentation, in where all trees are prunings of , is the desired process. Obviously, such the process exists. The converse statement follows from Lemma 4.

2.11. Modification of Phase 1

If topologies of the initial trees strongly contradict (an example is provided in [64]), then Phase 2 produces a tree with a nonzero cost; that is, according to Theorem 9, there exists a basic tree that cannot be obtained by pruning the output of Phase 2 to contain only species from the set . This situation occurs because the basic trees are optimal in terms of the functional , not in terms of the more accurate total mapping cost.

Computer simulations suggest (data not shown) that Phase 2 performs more accurately in the below case. Let be a fixed subset of and an element of . Prune each initial gene tree to (denote the result ) and each element from to (denote the result ; . For a fixed , apply Phase 1 to the sets and . Let be a basic tree over , if such exists. Apply Phase 2 to the set and denote the result . An analog of Theorem 9 is easily proved for with Lemma 10 stated below. If a set is basic for and , the basic trees over may be different.

Lemma 10. If a set is basic for , then it is basic for .

Proof. Since belongs to , it also belongs to . Use induction on the increase of . Singleton sets are always basic. If is a nonpruned basic set, it can be partitioned into two nonpruned basic subsets. By inductive assumption, the subsets are pruned basic. Then is also pruned basic.

The running time of modified Phase 1 is obviously times greater compared to standard Phase 1. For both versions of Phase 1, the complexity of Phase 2 has the order of , which is proved in [25].

2.12. Definitions of Binarization and Paralogous Binarization

Hereafter, only a canonic mapping is considered and applied to polytomous trees (in the definition of “for both children” is naturally replaced with the “for all children”, refer to Section 2.6). Fix a polytomous gene tree . Describe the procedure that starts from the initial and iteratively derives . Let in this procedure a tree be already derived and possess a polytomous vertex . Then arbitrarily divide the children of with their incoming edges into two nonempty parts and , and for each part (with the corresponding subtrees) introduce an intercalating edge connecting a new vertex (the ancestor of this part) with ; if a part is a singleton, the corresponding new vertex is eliminated (none of the trees contains edges with one child). The tree so acquires two or one new vertices and keeps the ones inherited from , and the vertex becomes binary. The described operation is the step of binarization of vertex against partition . Repeat the operation until all polytomous vertices are found. Name the obtained “resolved” tree a candidate binarization of .

Fix a binary species tree and the polytomous gene tree . Among all candidate binarizations of , find such that has the minimal embedding cost among the values ( is a variable); name a binarization of against .

By definition, for given and , an edge from enters (downwards) a tube in if
and henceforth designated .

For a vertex from or designate a tube that equals (if is a tube) or the tube incoming in (if is a vertex). For each vertex from , its clades in and are equal. For from the tube depends only on clade in ; that is, is the same in and in . Note that the triple inequality above is equivalent to

A paralogous binarization of against is a candidate binarization , in which for each tube the number of entering edges is minimal among all candidate binarizations . Intuitively, it minimizes the number of paralogs.

A paralogous binarization of exists and is produced from the initial with the following iterative procedure. Let a certain be already obtained. Choose arbitrarily a polytomous vertex g in , and let produce two child tubes, and . Divide all children of vertex into three parts defined according to the conditions , , , respectively. The parts are disjoint. If only the first part is nonempty, arbitrarily divide it in two nonempty sets. If the first and at least one of the other two parts are nonempty, the first set coincides with the first part, and the second set is the union of the second and third parts. If the first part is empty, the two sets are the second and third parts, correspondingly; both are nonempty by definition of . Perform a step of binarization of vertex against partition , where is the first set and is the second set. A new is thus derived. Apply the procedure until all polytomous vertices are visited; the result, according to Lemma 11, is the paralogous binarization of against .

A bundle of edges for in is a nonempty maximal on inclusion set of edges in that have the common upper terminus (the vertex parent of the bundle), and all enter .

Denote the amount of bundles in for . The vertex parent of a bundle is denoted by . Obviously, a bundle has a unique vertex parent; and vertex parents of different bundles for are different in (and ); edges of different bundles for are incomparable in .

A complement of bundle is a set of edges , for which and does not belong to . For the paralogous binarization , an edge (where and are in ) is called a parent of bundle in for , if is the last common ancestor of the lower termini of all edges in and is not the ancestor of the lower termini of all edges in .

Lemma 11. For any candidate binarization and mapping into , at least edges enter each tube . For (against ) and for each bundle (in for d) and the mapping into , its parent in exists and enters the tube . Conversely, each edge entering tube is the parent of a bundle for . Consequently, is a paralogous binarization.

Proof. The first statement. In the mapping of into , each bundle in for induces at least one edge entering ; different bundles induce different edges. Indeed, let be any edge in the bundle. Then on a path in connecting and , there exists an edge in that enters . As for any two bundles for , their vertex parents are different; for any two corresponding paths in , the set of edges in one path does not intersect with the set of edges in another path. Consequently, at least edges enter .The second statement. Let be a bundle for tube , and . By definition of the bundle, , and for all lower termini of edges in , we observe . If the vertex is binary or is the superroot, then , and the assertion is obvious. Let vertex be polytomous. If , the assertion is obvious. Otherwise, consider in a maximally long path of vertices , where each vertex descends directly from the other and is ancestor of the lower termini of all edges in the bundle . Observe ; otherwise during partitioning, the set of children of in the constructed , all children from the bundle belong in one part (the second or the third one), which contradicts the assumption of maximal . It follows that the edge enters the tube and is the parent of the bundle .The third statement. Let , where is an edge in . By constructing the candidate binarization, there exists such vertex in that , where is a child of in . Consider the set of children of vertex , for which in . The edges in having lower terminus in this set form a nonempty subset of a certain bundle for , where . Let be the bundle parent. Then is comparable with , and enters according to Lemma 11. Any two comparable edges cannot enter the same tube; therefore .

The described paralogous binarization procedure runs in linear time.

Lemma 12. Let be a bundle for , let be a bundle for (both in ), and . If , then in the paralogous binarization , the parents of and share the common upper terminus. And, conversely, if the parents of and share a common upper terminus in , then .

Proof. Define with a tube such that , and with a vertex . If is a binary vertex in , the assertion is obvious.Otherwise, consider in a maximally long path of consecutive vertices , where each is an ancestor of a set of all lower termini of edges in the union of and . Observe ; otherwise during partitioning the set of children of in the constructed , all its children from would belong in one part (the second or third one), which contradicts the assumption of maximal .The parents of and share a common upper terminus, as in the constructed the bundles and correspond to the second and third parts of the children set (the first part is empty, as follows from the assumption of maximal ). By the procedure, the parts are induced by separate edges, the parents of the corresponding bundles, and the two edges share the common upper terminus.Prove the converse statement by contradiction. Denote the parents of bundles and as and . Consider in a path connecting with the lower terminus of an arbitrary edge from and a path connecting with the lower terminus of an arbitrary edge from . Then contains and contains . By our assumption, . Consequently, no two edges, one belonging to and the other belonging to , can share a common upper terminus. The contradiction is obtained.

The number of different is exponential of the maximal number of edges in sharing the upper terminus , for which . Importantly, any can be legitimately used in Section 2.13, according to Lemma 13. Our algorithm of constructing one for any can be easily extended to enumerate any portion of the binarization solutions space.

Lemma 13. Embedding costs of all against a fixed are equal.

Proof. Each paralogous binarization possesses the same amount of vertices. Note that in a canonic mapping each edge entering a tube corresponds either to a divergence (if or loss (otherwise). Conversely, a divergence corresponds to a pair of edges with a common upper terminus entering the tubes a with common upper terminus, and a loss corresponds to edge entering tube , where . Hence, draw two bijective correspondences:(1) between divergences and unordered pairs , where , , , and ;(2) between losses and pairs , where , which do not fall in correspondence (1) with any divergence.According to these correspondences and Lemmas 11-12, in a mapping of a paralogous binarization into , there exist as many divergences as there are unordered pairs of bundles of the form in , where , . Other nonleaf vertices are duplications; therefore their amount does not depend on . The amount of losses is also -independent; according to the correspondences above and Lemma 11, there exist as many losses as there are bundles that do not fall in the correspondence with any divergence.

Lemma 14. In a paralogous binarization , let an edge be a parent of a bindle . The number of leaves contained in the clade of in does not depend on .

Proof. By definition of the bundle parent, the set of leaves contained in below is the set of leaves contained in below the edges of bundle . This set depends only on and .

For canonic mappings of (against ) into , hold the following analogs of Lemmas 5–7.

Lemma 15. In a canonic mapping of into , each leaf tube of species contains the amount of duplications equal to the difference between the total amount of leaves below the edges of the bundles from for and the amount of all bundles for in .

Proof. Denote this amount of duplications . In a canonic mapping of into , the edges parental to the bundles for a leaf tube enter the tube , and all nonleaf vertices in the tree lower to these edges are duplications. By Lemma 14, for each such bundle , there are leaf vertices lower to the parent of . A binary tree contains internal vertices compared with the number of leaves; therefore, the number of duplications is also of the number of edges in a bundle.

Lemma 16. Fix a polytomous tree over a subset of . Let species trees and be both defined over , each containing a certain subtree . The total costs of all events in the mappings of into and into , having place in , are equal. In other words, the total cost depends only on the subtree and not on its complement.

Proof. In a canonic mapping of into or , the edges of tree , the parents of the bundles for the root tube in a subtree , enter the tube . All vertices below these edges map into . Conversely, if a vertex maps into , then on the path connecting it with the superroot, there exists an edge entering and, by Lemma 11, being a parent of a bundle for . If an edge from is parental to a bundle for , then by Lemma 14 the set of leaves below is defined by the bundle and does not depend on . The number of vertices in a binary subtree is determined by the number of leaves. Consequently, the amount of vertices mapped into does not depend on . According to correspondence (1) stated in the proof of Lemma 1 and to Lemma 11, the amount of divergences in these vertices is exactly the amount of the unordered pairs in , where , , and lies in . Other nonleaf vertices are duplications. According to correspondence (2) stated in the proof of Lemma 1 and to Lemma 11, the number of losses in is also -independent and is exactly the number of bundles for tubes , which do not fall in a correspondence with any divergence where lies in .

Lemma 17. Fix a polytomous tree over a subset of . Let be a subset of , and a species tree over contains a subtree over . Let a tree be derived from by substituting the subtree with a subtree over . The total costs of all events in the mappings of into and having place in the complements of in and in are equal. In other words, the total event cost does not depend on a subtree.

Proof. The mapping maps into each vertices in a tree that are below the edges parental to the bundles for the root tube in . By definition of the bundle, the set of such bundles depends only on the clade of that does not depend on index . According to the argument in the proof of Lemma 16, the same amount of such vertices is mapped in each , regardless of . Consequently, the complement of receives the same amount of vertices. Among these vertices, the number of divergences equals exactly the number of the unordered pairs in , where , , and does not lie in . Other nonleaf vertices are duplications. The number of losses in is also -independent and is exactly the number of bundles for tubes , which do not fall in a correspondence with any divergence where does not lie in .

2.13. The Second Problem for a Fixed Set of Polytomous Gene Trees

The general problem for a given set of polytomous gene trees is to find a species tree that minimizes the total sum (over binarizations of all ) of the mappings of into . The unconditional (absolute) problem imposes no constraint on the solution space. In the conditional problem, the search space of trees (including