Tuesday, December 20, 2016

Isogloss maps are hypergraphs are bipartite networks

Linguists are a very special people. They are very proud, especially when
biologists tell them how to do phylogenetic analyses; but their pride is often
also justified, as many phylogenetic concepts were initially or independently developed by
linguists, be it the family tree model, proposed years before Darwin's
(1859) tree by Ćelakovský (1853), or
even the cladistic principle of synapomorphies, which are called "exclusively
shared innovations" in linguistics (see Brugmann 1884).

Linguists also invented one interesting kind of data-display which so far has
never been used by biologists (at least as far as I know): maps of isogloss
boundaries. The term "isogloss" is an unfortunate term, as it has multiple usages
in linguistics, and its history seems to go back to a naive borrowing from
chemistry (but I have not really followed the literature here). On most
occasions, it just means "shared trait". That is, it denotes a features shared
between two or more languages; and given that languages may share many
different features, isoglosses for a group of related languages may yield a
very complex type of data. Isoglosses are somehow related to the wave
theory,
the arch-enemy of the family tree in linguistics, which I described as a
mystical theory some time ago, since it never really made it to a clear-cut
model that could be formalized (The Wave Theory: the predecessor of network thinking in historical linguistics ).

Some linguists, nevertheless, insist that the waves that are the core of the
wave theory are nothing other than isoglosses. More specifically, the waves
represent innovations that contribute to the separation of languages (a change
in pronunciation of a word here, a change in grammar there), but which are not
transmitted vertically — they spread across the speakers of a language and may
even cross linguistic borders. One early visualization of these waves can be found in Bloomfield (1933), as shown here:

What Bloomfield essentially does here is pick certain traits of
Indo-European languages, calling them isoglosses, and arrange them on a
quasi-geographic map of Indo-European languages in such a way that all
languages sharing a trait are inside one of these isogloss boundaries.

Only
recently, I realised, what this actually means, when I found the "Bible of
Network Theory" by Newman (2010) and started reading at a
random page, which — as it turned out — treated hypergraphs. Hypergraphs,
as I learned from Newman, are graphs in which one edge can connect to more than
one node, and Newman used exactly the same visualization for these hyperedges as
Bloomfield had done in 1933, without knowing that it was actually a rather
complex network structure he was proposing.

Even more interesting than
the complex graph structure is that hypergraphs can be likewise displayed as
bipartite networks, in which we distinguish two fundamental kinds of nodes,
and in which connections are only allowed between nodes of different kinds,
without losing any information. In order to do so, one just converts all
hyperedges into a node that connects to all nodes (languages in our case) to
which the edges connect in the hypergraph. In the same way that Bloomfield
labeled the hyperedges in his legend, we can label the isogloss nodes that
connect to the languages. The following image shows the resulting bipartite network for Bloomfield's hypergraph:

If you now ask what this tells us after all, I will disappoint you — so far it
does not tell us anything, it is just a display of data in a different fashion.
Note, however, that hypergraph visualization is not a trivial problem, and if
you have enclaves not sharing a trait, it may even be impossible to visualize
hypergraphs in a two-dimensional space by just using one line that connects to
all nodes. Bipartite networks are easier to handle in this regard. Even more
importantly, however, bipartite graphs are also easy to handle algorithmically, and
biologists are currently developing new methods to handle them (Corel et al.
2016).

If we visualize the Bloomfield data in a bipartite network using network
visualization software such as Cytoscape, we can
conveniently explore the data, and arrange the nodes in order to search for
patterns in the isoglosses. The following visualization, for example, shows
that Bloomfield chose the data well in order to illustrate the amount
of conflicting, apparently non-tree-like, signal in Indo-European languages
(remember that linguists tend to dislike trees, but not necessarily in a
productive way), as the data describes more of a circular structure than a strict hierarchy.

In order to really interpret this kind of data, however, we should not forget
that this is still a data-display network. It is by no means a phylogenetic
analysis, as we only show how a certain amount of data selected by a scholar
and distributed over the given language groups. A true phylogenetic analysis will
need to interpret these data, making bold claims about the history of those
shared traits.

The existence of sibilants (s-like sounds, like [s, z, ʃˌ ʒ])
for certain velar sounds (k-like sounds, like [k, g, x]), for example, is a
trait shared by Balto-Slavic, Indo-Iranian, Armenian, and Albanian, but this
does not mean that they all inherited it from a common ancestor, as the process
of palatalization, by which velar sounds turn into affricates and fricatives
(compare French cent, which was pronounced kentum in Latin), is very
frequent in the languages of the world, and may well reflect independent
evolution.

Apart from independent development, which would actually force us
to revise our network, deleting the respective edges because they are not
homologous in the strict sense means that we may also have to deal with differential
loss. This quite likely happened with the shared feature labeled as "past e-"
in the network, referring to the past tense in Ancient Greek and Indo-Iranian,
which was augmented by the prefix e-.

A further reason for those
commonalities labelled as isoglosses by linguists may also be simple lateral
transfer due to language contact.

Proponents of the wave theory have taken this kind of data as proof that the
family tree model is essentially wrong. While I would agree that the family
tree model shows only a certain aspect of language evolution, and may therefore
be boring at times (and even wrong, if we do not manage to correctly
interpret the nature of shared traits), I have a hard time understanding why
linguists still insist that isogloss maps are an alternative model of language
evolution. They are surely not, in the same way in which splits graphs are not
phylogenetic networks, as David emphasized in a recent
blogpost.

Unless we add the missing time dimension and analyse how the shared traits
originated, isogloss maps and hypergraphs will remain nothing more than an interesting form of data visualization.
Given the recent research on bipartite networks, however, we may have some hope that
the mysterious waves in historical linguistics may not only
find a formal model of representation, but even bring us to the point
where we gain new insights into the history of our languages.

1 comment:

Just a guess, but the term 'isogloss' probably comes from a semantic extension of XIXth century usage of 'isophone' (which denoted not precisely "phonetic isoglosses" as today, but homophones between languages) and 'isoseme', terms used in the science of "etymology". Here an example from an Italian manual: https://books.google.com.br/books?id=1Ig_AAAAIAAJ&pg=PA244&dq=isofona&hl=pt-BR&sa=X&redir_esc=y#v=onepage&q=isofona&f=false