Abstract

Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink” problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset.

Keywords

Notes

Acknowledgments

This work was supported by the NSF grant IIS-1565862 to SM and UM. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF grant ACI-1053575.

Proof

We need to prove that a \(\mathcal {R}_k(t)\) is either a reasonable removing set or it is not an optimal removing set. We proceed by contradiction. Assume \(\mathcal {R}_k(t)\) is optimal but not a reasonable removing set. Let \(\mathcal {R}_m(t)\) be the largest reasonable removing set that is a subset of \(\mathcal {R}_k(t)\) (note \(0\le m\le k\)). If \(m = k\), then \(\mathcal {R}_k(t)\) is a reasonable set, contradicting the assumption. For \(m < k\), consider the tree Open image in new window and let \(a_m,b_m\) be its diameter pair. if \(a_m \in \mathcal {R}_k(t)\) or \(b_m \in \mathcal {R}_k(t)\), adding them to \(\mathcal {R}_m(t)\) would generate a reasonable chain of size \(m+1\), contradicting our assumption. If \(a_m \notin \mathcal {R}_k(t)\) and \(b_m \notin \mathcal {R}_k(t)\), all removals after m in \(\mathcal {R}_k(t)\) fail to reduce the diameter, but removing either \(a_m\) or \(b_m\) would reduce the diameter. Thus, \(\mathcal {R}_k(t)\) cannot be optimal, contradicting our assumption. \(\square \)

Proof

To remove k leaves from a singly paired tree t that has (a, b) as a diameter pair, at least one of a or b has to be removed (or else the diameter never decreases). Thus, three types of reasonable chains exist: those that contain only a, those that contain only b, and those that contain both a and b. Note that after removing a, by Proposition 1, removing b is a reasonable removal (and vice versa), and thus, removing both a and b is always reasonable (in either order).

Case 1:\(a\in \mathcal {R}_{k-2}, b\notin \mathcal {R}_{k-2}\): If a reasonable chain has a but not b, by Proposition 1, b is in the diameter set at each step of the chain. Since b by definition is never removed and recalling that the tree is singly paired, at each step, there is only one reasonable removal (whatever taxon is on-diameter in addition to b). Therefore, only one reasonable chain does not include b.

Case 3:\(a\in \mathcal {R}_{k-2}, b\in \mathcal {R}_{k-2}\): In this case, the reasonable removing chain must start with a, b or b, a. In either ordering, we are left with the same induced tree, and need to remove \(k-2\) more leaves. Therefore, the set of all reasonable removing sets in this case is: Open image in new window.

Proof

Recall that the record of each internal node keeps track of the most distant leaves below two children of the node. When we remove a, only those nodes on the path from a to the root can have a change in the their record. The first traversal of Algorithm 1 updates the records for those nodes, using simple recursive functions that can be computed in O(1) per node.

According to Proposition 1, \(b \in \mathcal {D}(t\backslash _a)\). Therefore, one of the longest paths in \(t\backslash _a\) must include b; let c be the other taxon. The record of the LCA of c and b, after the update in the first round, will have the value of this longest value. Thus, by checking the updated record for all nodes in the path from b to the root we will find the maximum value. Moreover, when updating the records in the first traversal from a to the root, we have already checked all the nodes from the LCA(a, b) to the root. In the second traversal, we check the nodes from b to LCA(a, b), completing the search. Each of the two traversals of Algorithm 1 visits at most h nodes and only need constant time operations in each visit. Therefore, the overall time complexity of Algorithm 1 is O(h). \(\square \)

Lemma 2

Proof

If t has more than one diameter pair, let (a, b) and (c, d) be two distinct diameter pairs of t and let m be the midpoint of the path between a and b. We prove that m is also the midpoint of the path between c and d, that is m lies on that path between c and d and \({\delta }({m},{c}) = {\delta }({m},{d})\). w.l.o.g, we suppose \({\delta }({m},{c}) \ge {\delta }({m},{d})\).

We prove that the path between c and d must pass m; that is, c and d belong to two different subsets in the partition defined by m on \(\mathcal {L}\) (we call elements of the partition a “side”). We prove by contradiction, assuming c and d belong to the same side of m. Then \({\delta }({m},{c}) + {\delta }({m},{d}) > {\delta }({c},{d})\). Also, either a or b must be on a different side from c and d to m (by definition, a and b cannot be on the same side to m). Suppose a is in a different side from c and d to m. Then: \({\delta }({a},{b}) \ge {\delta }({a},{c}) \implies {\delta }({a},{m}) + {\delta }({m},{b}) \ge {\delta }({a},{m}) + {\delta }({m},{c}) \implies {\delta }({m},{b}) \ge {\delta }({m},{c}) \implies {\delta }({m},{a}) \ge {\delta }({m},{c})\). So we have, \({\delta }({a},{c}) = {\delta }({m},{a}) + {\delta }({m},{c}) \ge 2{\delta }({m},{c}) \ge {\delta }({m},{c}) + {\delta }({m},{d}) > {\delta }({c},{d})\); this leads to a contradiction because (c, d) is a diameter pair.

Case 2: c belongs to the same side of a to m. Then c belongs to a different side of b to m. Similar to case 1, in this case we can prove that \({\delta }({a},{b}) < {\delta }({b},{c})\) which also leads to a contradiction.

Thus, m is the midpoint of the path between c and d. \(\square \)

This lemma allows us to define some new concepts that are useful in the rest of the proof.

New Definitions: The single midpoint of any tree t partitions the diameter set into disjoint subsets; we call each of those subsets a diameter group of t (if the midpoint is in the middle of the branch, we have two diameter groups; a midpoint coinciding on an internal node would give three or more groups). We call any restriction of t with k leaves removed a k-optimal restricted tree if no other restriction removing k leaves has a lower diameter. We call a tree tk-shrinkable if there exists a k-removing set that strictly reduces its diameter. We call any induced tree on t that has a smaller diameter than t a shrunk tree of t. Note that unless all but one of the diameter groups of a tree t are removed, the tree cannot shrink in diameter. When all but one of the diameter groups of a tree t is removed, we refer to the resulting tree as a minimum shrunk tree of t.

It is easy to see the following lemma.

Lemma 3

For all a and b, \((a,b)\in \mathcal {P}(t)\) if and only if a and b belong to two distinct diameter groups. \(\square \)

Now we prove a less obvious Lemma.

Lemma 4

If tree t is k-shrinkable, any k-optimal restricted tree \(t^*\) can be induced from one of the minimum shrunk trees of t.

Proof

Because t is k-shrinkable, the diameter of \(t^*\) must be strictly smaller than the diameter of t. Suppose \(t^*\) is not an induced tree of any minimum shrink tree of t; then, t* has at least two leaves from two different diameter groups of t. Based on Lemma 3, \(t^*\) shares with t at least one diameter pair and therefore, has the same diameter as t, which is a contradiction. \(\square \)

To produce any minimum shrunk tree \(t^i\) with \(k^i \le k^p\), we can start from any removal (a, b) such that \(a\in D^x\) and \(b\in D^y\) (for \(x\ne y\)), and continue to produce \(t^i\). To see this, note that if \(x \ne y \ne i\), any chain that starts with either a or b and continues to select from any groups other than \(D^i\) will produce the minimum shrunk tree \(t^i\) after \(k^i\) removals. Now, w.l.o.g, consider \(x= i\) and \(y \ne i\). Then, consider the chain that starts by removing y and continues by removals from any group other than \(D^i\). This chain will also produce \(t^i\) after \(k^i\) removals. In other words, each pair-restricted k-removing space of t can produce all the minimum shrunk trees \(t^i\) that have \(k^i \le k\).

Based on Lemma 4, when t is k-shrinkable, at least one of the minimum shrunk trees (say \(t^*_i\)) can induce any k-optimal restricted tree \(t^*\). We also just proved that any pair-restricted space can produce all minimum shrunk trees. Therefore, any arbitrary pair-restricted removing space will include a chain that induces \(t^*_i\) from t and another chain that produces \(t^*\) starting from \(t^*_i\). Thus, the union of the removing sets corresponding to these two chains will produce \(t^*\) and will be part of any arbitrary pair-restricted k-removing space.

B Supplementary Figures

Alternative implementations of TreeShrink. Using \(\varDelta \) values instead of \(\log \varDelta \) values had no discernible impact on the trajectory, though it did slightly change the results. Fitting kernels on individual genes instead of fitting it on the full set of genes reduced the accuracy, but still was much better than the control random.

Impact of taxon removal on gene tree discordance. For each of the four biological dataset, we show the reduction in the distance between the species tree and gene trees measured by the MS metric (y-axis) versus the total proportion of the taxa retained in the gene trees after filtering (x-axis). Average delta MS values and standard error bars are shown over all genes for each dataset. A line is drawn between all five points corresponding to each method.