Tuesday, September 26, 2017

This is the 500th post from this blog, making it one of the longest-running blogs in phylogenetics, if not the longest. For example, among the phylogenetics blogs that I have previously listed, there has been only one post so far this year that has not been about a specific computer program.

Our first blog post was on Saturday 25 February 2012; and most weeks since then have had one or two posts. We have covered a lot of ground during that time, focusing on the use of network graphs for phylogenetic data, broadly defined (ie. including biology, linguistics, and stemmatology). However, we have not been averse to applying what are know as "phylogenetic networks" to other data, as well; and to discussing phylogenetic trees, when appropriate.

For this 500th post, I though that I should focus on what seems to me to be one of the least appreciated aspects of biology — the need to look at data before formally analyzing it.

Phylogeneticists, for example, have a tendency to rush into some specified form of phylogenetic analysis, without first considering whether that analysis is actually suitable for the data at hand. It is therefore wise to investigate the nature of the data first, before formal analysis, using what is known as exploratory data analysis (EDA).

EDA involves getting a picture of the data, literally. That picture should be clear, as well as informative. That is, it should highlight some particular characteristics of the data, whatever they may be. Different EDA tools are likely to reveal different characteristics — there is not single tool that does it all. That is why it is called "exploration", because you need to have a look around the data using different tools.

This is where splits graphs come into play, perhaps the most important tool developed for phylogenetics over the past 50 years.

Splits graphs

Splits graphs are the best current tools for visualizing phylogenetic data. They were developed back in 1992, by Hans-Jürgen Bandelt & Andreas Dress. These graphs had a checkered career for the first 15 years, or so, but they have become increasingly popular over the past 10 years.

It is important to note that splits graphs are not intended to represent phylogenetic histories, in the sense of showing the historical connections between ancestors and descendants. This does not mean that there is no reason why should not do so, but it is not their intended purpose. Their purpose is to display phenetic data patterns efficiently. In this sense, calling them "phylogenetic networks" may be somewhat misleading — they are data-display networks, not evolutionary networks.

A split is simply a partitioning of a group of objects into two mutually exclusive subgroups (a bipartition). In biology, these objects can be individuals, populations, species, or even higher taxonomic groups (OTUs); and in the social sciences, they might be languages or language groups, or they could be written texts, or verbal tales, or tools or any other human artifacts. Any collection of objects will contain a set of such splits, either explicitly (eg. based on character data) or implicitly (eg. based on inter-object distances). A splits graph simultaneously displays some subset of the splits.

Ideally, a splits graph would display all of the splits; but for realistic biological data this is not likely to happen — the graph would simply be too complex for interpretation. So, a series of graphing algorithms have been developed that will display different subsets of the splits. That is, splits graphs actually form a family of closely related graphs. Technically, the Median Network is the only graph type that tries to display all of the splits; however, the result will usually be too complicated to be useful for EDA.

So, these days there is a range of splits-graph methods available for character-based data (such as Median Networks and Parsimony Splits), distance-based data (such as NeighborNet and Split Decomposition), and tree-based data (such as Consensus Networks and SuperNetworks). In population genetics, haplotype networks can be produced by methods that conceptually modify Median Networks (such as Reduced Median Networks and Median-Joining Networks).

The purpose of this post, however, is not to discuss all of the types of splits graphs, but to consider what computer tools we would need in order to successfully use this family of graphs for EDA in phylogenetics.

Desiderata

The basic idea of EDA is to have a picture of the data. So, any computer program for EDA in phylogenetics needs to be able to quickly and easily produce the splits graph, and then allow us to explore and manipulate it interactively.

To do this, the features listed below are the ones that I consider to be most helpful for EDA (and thanks to Guido Grimm and Scot Kelchner for making some of the suggestions). It would be great to have a computer program that implements all of these features, but this does not yet exist. SplitsTree has some of them, making it the current program of choice. However, there is quite some way to go before a truly suitable program could exist.

Note that these desiderata fall into several groups:

evaluating the network itself

comparing the network to other possible representations of the data

manipulating the presentation of the network

It is desirable to be able to interactively:

specify which supported splits are shown in the graph— eg. show only those explicitly supported by character

list the split-support values

highlight particular splits in the graph — eg. by clicking on one of the edges

identify splits for specified taxon partitions (if the split is supported) — this is the complement to the previous one, in which we specify the split from a list of objects, not from the graph itself

identify which splits are sensitive to the model used — eg. different network algorithms

identify which edges are missing when comparing a planar graph with an n-dimensional one — this would potentially be complex if one compares, say, a NeighborNet to a Median Network

map support values onto the graph (ie. other than split support, which is usually the edge length) — eg. bootstrap values

evaluate the tree-likeness of the network — ie. the extent of reticulation needed to display the data

map edges from other networks or trees onto the graph — this allows us to compare graphs, or to superimpose a specified tree onto the network

find out if the network is tree-based, by breaking it down into a
defined number of trees —along with a measure for how comprehensive
these trees capture the network

create a tree-based network by having the network be the super-set of
some specified tree — eg. the NeighborNet graph could be a superset of
the Neighbor-Joining tree

remove trivial splits — eg. those with edges shorter than some specified
minimum, assuming that edge length represents split support

plot
characters onto the graph — possibly next to the object labels, but
preferably on the edges if they are associated with particular partitions

examine which subsets of the data are responsible for the reticulations — eg. for character-based inputs this might a sliding window that updates the network for each region of an alignment, or for tree-based inputs it might be a tree inclusion-exclusion list.

Other relevant posts

Here are some other blog posts that discuss the use of splits graphs for exploring genealogical data.

Tuesday, September 19, 2017

Arguments from authority play an important role in our daily lives and our
societies. In political discussions, we often point to the opinion of trusted
authorities if we do not know enough about the matter at hand. In medicine,
favorable opinions by respected authorities function as one of four levels of evidence (admittedly, the
lowest) to judge the strength of a medicament. In advertising, the (at times
doubtful) authority of celebrities is used to convince us that a certain
product will change our lives.

Arguments from authority are useful, since they
allow us to have an opinion without fully understanding it. Given the
ever-increasing complexity of the world in which we live, we could not do
without them. We need to build on the opinions and conclusions of others in
order to construct our personal little realm of convictions and insights. This
is specifically important for scientific research, since it is based on
a huge network of trust in the correctness of previous studies which no single
researcher could check in a lifetime.

Arguments from authority are, however, also dangerous if we blindly trust
them without critical evaluation. To err is human, and there is no guarantee
that the analysis of our favorite authorities is always error proof. For example, famous
linguists, such as Ferdinand de Saussure
(1857-1913) or Antoine Meillet (1866-1936),
revolutionized the field of historical linguistics, and their theories had a
huge impact on the way we compare languages today. Nevertheless, this does not
mean that they were right in all their theories and analyses, and we should
never trust any theory or methodological principle only because it was
proposed by Meillet or Saussure.

Since people tend to avoid asking why their authority came to a certain
conclusion, arguments of authority can be easily abused. In the extreme, this
may accumulate in totalitarian societies, or societies ruled by religious
fanatism. To a smaller degree, we can also find this totalitarian attitude
in science, where researchers may end up blindly trusting
the theory of a certain authority without further critically investigating it.

The comparative method

The authority in this context does not necessarily need to be a real person, it can also be a theory or a certain methodology.
The financial crisis from 2008
can be taken as an example of a methodology, namely classical
"economic forecasting", that turned out to be trusted much more than it
deserved.
In historical linguistics, we have a similar quasi-religious attitude
towards our traditional comparative method (see Weiss 2014
for an overview), which we use in order to compare languages.
This "method" is in fact no method at all, but rather a huge bunch of
techniques by which linguists have been comparing and reconstructing
languages during the past 200 years.
These include the detection of cognate or "homologous" words across
languages, and the inference of regular sound correspondence patterns
(which I discussed in a blog from October last year),
but also the reconstruction of sounds and words of ancestral languages
not attested in written records, and the inference of the phylogeny of a
given language family.

In all of these matters, the comparative method enjoys a quasi-religious
authority in historical linguistics. Saying that they do not follow
the comparative method in their work is among the worst things you can
say to historical linguists. It hurts. We are conditioned from when we were small to
feel this pain. This is all the more surprising, given that scholars rarely
agree on the specifics of the methodology, as one can see from the
table below, where I compare the key tasks that different authors
attribute to the "method" in the literature. I think one can easily see
that there is not much of an overlap, nor a pattern.

Varying accounts on the "comparative methods" in the linguistic literature

It is difficult to tell how this attitude evolved. The foundations of the
comparative method go back to the early work of scholars in the 19th century,
who managed to demonstrate the genealogical relationship of the Indo-European languages.
Already in these early times, we can find hints regarding the "methodology"
of "comparative grammar" (see for example Atkinson 1875),
but judging from the literature I have read, it seems that it was not before the
early 20th century that people began to introduce the techniques for historical
language comparison as a methodological framework.

How this framework became
the framework for language comparison, although it was never really
established as such, is even less clear to me. At some point the linguistic
world (which was always characterized by aggressive battles among colleagues,
which were fought in the open in numerous publications) decided that the
numerous techniques for historical language comparison which turned out to be
the most successful ones up to that point are a specific method, and that this
specific method was so extremely well established that no alternative approach
could ever compete with it.

Biologists, who have experienced drastic methodological changes
during the last
decades, may wonder how scientists could believe that any practice,
theory, or
method is everlasting, untouchable and infallible. In fact, the
comparative
method in historical linguistics is always changing, since it is a
label rather than a true framework with fixed rules. Our insights into various
aspects of language change is constantly increasing, and as a result,
the way
we practice the comparative method is also improving. As a result, we
keep
using the same label, but the product we sell is different from the one
we sold
decades ago. Historical linguistics are, however, very conservative
regarding the authorities they trust, and our field was always very
skeptical regarding any new methodologies which were proposed.

Morris Swadesh (1909-1967), for example, proposed a
quantitative approach to infer divergence dates of language pairs (Swadesh
1950 and later), which was immediately refuted, right after
he proposed it (Hoijer 1956, Bergsland and Vogt
1962). Swadesh's idea to assume constant rates of lexical
change was surely problematic, but his general idea of looking at lexical
change from the perspective of a fixed set of meanings was very creative in
that time, and it has given rise to many interesting investigations (see, among
others, Haspelmath and Tadmor 2009). As a result,
quantitative work was largely disregarded in the following decades. Not
many people payed any attention to David Sankoff's (1969)
PhD thesis, in which he tried to develop improved models of lexical change in
order to infer language phylogenies, which is probably the reason why Sankoff
later turned to biology, where his work received the appreciation it
deserved.

Shared innovations

Since the beginning of the second millennium, quantitative studies have enjoyed a new
popularity in historical linguistics, as can be seen in the numerous papers that have been devoted to automatically inferred phylogenies (see Gray and
Atkinson 2003 and passim). The field has begun to accept these
methods as additional tools to provide an understanding of how our languages evolved
into their current shape. But scholars tend to contrast these new techniques sharply with the
"classical approaches", namely the different modules of the comparative method.
Many scholars also still assume that the only valid technique by which phylogenies (be it trees or networks) can be inferred
is to identify shared innovations in the languages under investigation (Donohue et al. 2012, François 2014).

The idea of shared innovations was first proposed by Brugmann
(1884), and has its direct counterpart in Hennig's
(1950) framework of cladistics. In a later book of Brugmann, we find the following passage on
shared innovations (or synapomorphies in Hennig's terminology):

The only thing that can shed light on the relation among the individual language branches
[...] are the specific correspondences between two or more of them, the innovations,
by which each time certain language branches have advanced in comparison with other
branches in their development. (Brugmann 1967[1886]:24, my translation)

Unfortunately, not many people seem to have read Brugmann's original text in
full. Brugmann says that subgrouping requires the identification of shared
innovative traits (as opposed to shared retentions), but he remains skeptical
about whether this can be done in a satisfying way, since we often do not know
whether certain traits developed independently, were borrowed at later stages,
or are simply being misidentified as being "shared". Brugmann's proposed solution to
this is to claim that shared, potentially innovative traits, should be numerous
enough to reduce the possibility of chance.

While biology has long since abandoned the cladistic idea, turning instead
to
quantitative (mostly stochastic) approaches in phylogenetic
reconstruction,
linguists are surprisingly stubborn in this regard. It is beyond
question that
those uniquely shared traits among languages that are unlikely to have
evolved by
chance or language contact are good proxies for subgrouping. But they
are often very hard to identify,
and this is probably also the reason why our understanding about the
phylogeny of the Indo-European language family has not improved much
during the past 100 years.
In situations where we lack any striking evidence, quantitative
approaches may as well be used to infer potentially innovated traits,
and if we do a
better job in listing these cases (current software, which was designed
by
biologists, is not really helpful in logging all decisions and inferences
that
were made by the algorithms), we could profit a lot when turning to
computer-assisted frameworks in which experts thoroughly evaluate the
inferences which were made by the automatic approaches in order to generate new
hypotheses and improve our understanding of our language's past.

A further problem with cladistics is that scholars often use the term shared
innovation for inferences, while the cladistic toolkit and the reason why
Brugmann and Hennig thought that shared innovations are needed for subgrouping
rests on the assumption that one knows the true evolutionary history (DeLaet
2005: 85). Since the true evolutionary history is a tree in
the cladistic sense, an innovation can only be identified if one knows the
tree. This means, however, that one cannot use the innovations to infer the
tree (if it has to be known in advance). What scholars thus mean when talking
about shared innovations in linguistics are potentially shared innovations,
that is, characters, which are diagnostic of subgrouping.

Conclusions

Given how quickly science evolves and how non-permanent our knowledge and our
methodologies are, I would never claim that the new quantitative approaches are
the only way to deal with trees or networks in historical linguistics. The last
word on this debate has not yet been spoken, and while I see many points critically, there are also many points for concrete improvement (List 2016).
But I see very clearly that our tendency as historical linguists to take the
comparative method as the only authoritative way to arrive at a valid
subgrouping is not leading us anywhere.

Do computational approaches really switch off the light which illuminates classical historical linguistics?

In a recent review, Stefan Georg, an expert on Altaic languages, writes that
the recent computational approaches to phylogenetic reconstruction in
historical linguistics "switch out the light which has illuminated
Indo-European linguistics for generations (by switching on some computers)", and
that they "reduce this discipline to the pre-modern guesswork stage [...] in
the belief that all that processing power can replace the available knowledge
about these languages [...] and will produce ‘results’ which are worth the
paper they are printed on" (Georg 2017: 372, footnote). It seems to me,
that, if a discipline has been enlightened too much by its blind trust in
authorities, it is not the worst idea to switch off the light once in a while.

References

Anttila, R. (1972): An introduction to historical and comparative linguistics. Macmillan: New York.

Harrison, S. (2003): On the limits of the comparative method. In: Joseph, B. and R. Janda (eds.): The handbook of historical linguistics. Blackwell: Malden and Oxford and Melbourne and Berlin. 213-243.

Haspelmath, M. and U. Tadmor (2009): The Loanword Typology project and the World Loanword Database. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world’s languages. de Gruyter: Berlin and New York. 1-34.

Monday, September 11, 2017

Many elections now have some sort of online black box that allow you to see which political party or candidate has the highest overlap with your own personal political opinions. This is intended to help voters with their decisions. However, the black boxes usually lack any documentation regarding how different are the viewpoints of the competing parties / candidates. Exploratory data analysis via Neighbour-nets may be of some use in these cases.

As a European Union citizen (of German and Swedish nationality) I am entitled to live and work in any EU country. I currently live in France, but I cannot vote for the parliament (Assemblée nationale) and government (M. Le Président) that affects my daily life, and decides on the taxes, etc, that I have to pay. However, I’m still eligible to vote in Germany (in theory; in practice it is a bit more complex).

The next election (Budestagswahl) is closing in for the national parliament of the Federal Republic of Germany, the Bundestag (equivalent to the lower house of other bicameral legislatures). To help the voters, a new Wahl-O-Mat(described below) has been launched by the Federal Institute of Political Education (Bundeszentrale für politische Bildung, BPB). This is a fun thing to participate in, even if you have already made up your mind about who to vote for.

Each election year, the BPB develops and sends out a questionnaire with theses (83 this year) to all of political parties that will compete in the election. The parties can answer with ‘agree’, ‘no opinion / neutral’, or ‘don’t agree’ for each thesis. The 38 most controversially discussed political questions have been included in the Wahl-O-Mat, and you can also answer them for yourself. As a final step, you can choose eight of the political parties competing for the Bundestag, and the online back box will show you an agreement percentage between you and them in the form of a bar-chart diagram.

But as a phylogeneticist / data-analyst, I am naturally sceptical when it comes to mere percentages and bar charts. Furthermore, I would like to know how similar the parties’ opinions are to each other, to start with. An overview is provided, with all of the answers from the parties, but it is difficult to compare these across pages (each page of the PDF lists four parties, in the same order as on the selection page). The Wahl-O-Mat informs you that a high fit of your answers with more than one party does not necessarily indicate a closeness between the parties — you may, after all, be agreeing with them on different theses.

This means that the percentage of agreement between me and the political parties would provide a similarity measure, which I can use to compare the political parties with each other. But how discriminatory are my percentages of agreement (from the larger perspective)?

A network analysis

There are 33 parties that are competing for seats in the forthcoming Bundestag, one did not respond. Another one, the Party for Health Research (PfHR — a one-topic party) answered all 36 questions with 'neutral'. However, the makers of the Wahl-O-Mat still had to include it; and since that party provided no opinion on any of the questions, I scored 50% agreement with them (since I answered every question with 'yes' or 'no') — this is more than with the Liberal Party (because we actually disagree on half of the 38 questions). This is a flaw in the Wahl-O-Mat. If you say 'yes' (or 'no') to a thesis that the party has no opinion on, then it is counted as one point, while two points are awarded for a direct match. However, it does not work the other way around — having no opinion on any question brings up a window telling you that your preference cannot be properly evaluated.

Because of this, I determined my position relative to the political parties using a neighbour-net. The primary character matrix is binary, where 0 = ‘no’, 1 = ‘yes’ and ‘?’ stands for no opinion (neutral), compared using simple (Hamming) pairwise distances. So, if two parties disagree for all of the theses their pairwise distance will be 1. If there is no disagreement, the pairwise distance will be 0. Since the PfHR has provided no opinion, I left it out (ie. its pairwise distances are undefined).

Fig. 1 Neighbour-net of German political parties competing in the 2017 election (not including me). Parties of the far-left and far-right are bracket, for political orientation. Parties with a high chance to get into the next Bundestag (passing the 5% threshold) are in bold. [See also this analysis by The Political Compass, for comparison].

The resulting network (Figure 1) is quite fitting: the traditional perception of parties (left-wing versus right-wing) is well captured. Parties, like the ÖDP (green and conservative), that do not fit into the classic left-right scheme are placed in an isolated position.

The graph reveals a (not very surprising) closeness between the two largest German political parties, the original Volksparteien (all-people parties): the CDU/CSU (centre-right, the party of the current Chancellor) and the SPD (centre-left). The SPD is the current (and potentially future) junior partner of the CDU/CSU, its main competitor. According to the graph, an alternative, more natural, junior partner of the CDU/CSU would be the (neo-)liberal party, the FDP.

The parties of the far-right are placed at the end of a pronounced network stem — that is they are the ones that deviate most from the consensus shared by all of the other parties. They are (still) substantially closer to the centre-right parties than to those from the (extreme) left. However, the edge lengths show that, for example, a hypothetical CDU/CSU–AfD coalition (the AfD is the only right-wing party with a high chance to pass the 5% threshold) would have to join two parties with many conflicting viewpoints. That is, regarding their answers to the 38 questions, in general the CSU appears to be much closer to the AfD than to it's sister party, the CDU.

Regarding the political left, the graph depicts its long-known political-structure problem: there are many parties, some with very unique viewpoints (producing longer terminal network edges); but overall there is little difference between them. The most distinct parties in this cluster are the Green Party (Die Grünen) and the Humanist Party (Die Humanisten), a microparty promoting humanism (see also Fig. 2).

Any formal inference is bound by its analysis rules, which may represent the primary signal suboptimally. The neighbour-net is a planar graph, but profiles of political parties may require more than two dimensions to do a good job. So let's take a look at the underlying distance matrix using a ‘heat map’ (Figure 2).

Fig. 2 Heat-map based on the same distance matrix as used for inferring the neighbour-net in Fig. 1. Note the general similarity of left-leaning parties and their distinctness to the right-leaning parties.

We can also see that the party with the highest agreement with the SPD is still the Greens (DieGrünen). Furthermore, although the FDP and the Pirate Party have little in common, the Humanist Party (Die Humanisten) may be a good alternative when you’re undecided between the other two. [Well, it would be, if in Germany each vote counts the same, but the 5% threshold invalidates all votes cast for parties not passing the threshold.] The most unique party, regarding their set of answers and the resulting pairwise distances, is a right-wing microparty (see the network above) supporting direct democracy (Volksabstimmung).

Applications such as the Wahl-O-Mat are put up for many elections, and when documented in the way done by the German Federal Institute of Political Education, provide a nice opportunity to assess how close are (officially) the competing parties, using networks.

PS. For our German readers who are as yet undecided: the primary character matrix (NEXUS-formatted) and related files can be found here.

This is a toolkit rather than simple-to-use program, meaning that the various analyses exist as separate entities that can be combined in any way you like. More importantly, new analyses can be added easily, by those who want to write them, which is not the case for more commonly used programs like SplitsTree. This way, the analyses can also be incorporated into processing pipelines, rather than only being used interactively.

Apart from the usual access to data files (including Nexus, Phylip, Newick, Emboss and FastA formats), the following network analyses are currently available:

NeighborNet, NetMake, QNet, SuperQ, FlatNJ, NetME

The program also outputs the networks, of course. Here is an example of the SPECTRE equivalent of a NeighborNet analysis from a recent blog post (where the network was produced by SplitsTree, and then colored by me).

Running the program(s) is relatively straightforward, once you get things installed. Installation packages are available for OSX, Windows and Linux.

Sadly, for me installation was tricky, because SPECTRE requires Java v.8, which is unfortunately not available for OSX 10.6 (which runs on most of my computers). Even getting Java v.8 installed on the one computer I have with a later version of OSX was not easy, because installing a Java Runtime Environment (the JRE download file) from Oracle does not update the Java -version symlinks or add Java to the software path — for this I had to install the full Java Development Kit (the JDK download file). Sometimes, I hate computers!