Monday, September 24, 2018

Structural data in historical linguistics

The majority of historical linguists compare words to
reconstruct the
history of different languages. However, in phylogenetic studies focusing on
cognate
sets reflecting shared homologs across the languages under
investigation, there
exists another data type that people have been trying to explore in the
past.
The nature of this data type is difficult to understand for
non-linguists,
given that it has a very abstract nature. In the past, it has led to a
considerable amount of confusion both among linguists and among
non-linguists
who tried to use this data for quick (and often also dirty) phylogenetic
approaches. For this reason, I figured it would be useful to introduce
this type of data in more detail.

This data type can be called "structural". To enable interested readers to experiment
with
the data themselves, this blogpost comes along with two example
datasets that we converted into a
computer-readable format (with much help from David), since the original
papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can,
and should, be explored with network methods. In this first blogpost, I will
try to explain the basic structure of the data.

Structural data in historical linguistics and language typology

In order to illustrate the type of data we are dealing with here,
let's have a look at a typical dataset, compiled by the famous linguist
Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.

No.

Feature

Beijing

Suzhou

Meixian

Guangzhou

1

The third person pronoun is tā, or cognate to it

+

-

-

-

4

Velars palatalize before high-front vowels

+

+

-

-

7

The qu-tone lacks a register distinction

+

-

+

-

12

The word for "stand" is zhàn or cognate to it

+

-

-

-

In this example, the data is based on a questionnaire that
provides specific questions; and for each of the languages in the sample, the
dataset answers the question with either + or -. Many of these datasets are
binary in their nature, but this is not a necessary condition, and
questionnaires can also query categorical variables, such as, for example, the
major type of word order might have three categories (subject-object-verb, subject-verb-object or
other).

We can also see is that the questions can be very diverse. While we often
use more or less standardized concept lists for lexical research (such as
fixed lists of basic concepts, List et al. 2016), this kind
of dataset is much less standardized, due to the nature of the questionnaire:
asking for the translation of a concept is more or less straightforward, and
the number of possible concepts that are useful for historical research is
quite constrained. Asking a question about the structure of a language,
however, be it phonological, lexical, based on attested sound changes, or on
syntax, provides an incredible number of different possibilities. As a result,
it seems that it is close to impossible to standardize these questions across
different datasets.

Although scholars often call the data based on these questionnaires
"grammatical" (since many questions are directed towards grammatical features,
such as word order, presence or absence of articles, etc.), most datasets show
a structure in which questions of phonology, lexicon, and grammar are mixed.
For this reason, it is
misleading to talk of "grammatical datasets", but instead the term "structural
data" seems more adequate, since this is what the datasets were originally
designed for: to investigate differences in the structure of different
languages, as reflected in the most famous World Atlas of Language Structures
(Dryer and Haspelmath 2013, https://wals.info).

Too much freedom is a restriction

In addition to mixed features that can be observed without knowing the history
of the languages under investigation, many datasets (including the one by
Norman we saw above) also use explicit "historical" (diachronic in
linguistic terminology) questions in their questionnaires. In his paper
describing the dataset, Norman defends this practice, as he argues that the
goal of his study is to establish an historical classification of the Chinese
dialects. With this goal in mind, it seems defensible to make use of historical
knowledge and to include observed phenomena of language change in general, and
sound change in specific, when compiling a structural dataset for group of
related language varieties.

The problem of the extremely diverse nature of questionnaire items in
structural datasets, however, makes their interpretation extremely
difficult.
This becomes especially evident when using the data in combination
with
computational methods for phylogenetic reconstruction. This is
problematic for two major reasons.

Since questions are by nature less restricted regarding their content,
scholars can easily pick and choose the features in such a way that they
confirm the theory they want them to confirm rather than testing it
objectively. Since scholars can select suitable features from a virtually
unlimited array of possibilities, it is extremely difficult to guarantee
the objectivity of a given feature collection.

If features are mixed, phylogenetic methods that work on explicit
statistical models (like gain and loss of character states, etc.) may
often be inadequate to model the evolution of the characters, especially
if the characters are historical. While a feature like "the language
has an article" may be interpreted as a gain-loss process (at some
point, the language has no article, then it gains the article, then it
looses it, etc.), features showing the results of processes, like "the
words that originally started in [k] followed by a front vowel are now pronounced as [tɕ]", cannot be interpreted as a process, since the feature itself describes a process.

For these reasons, all phylogenetic studies that make use of structural data, in
contrast to purely lexical datastes, should be taken with great care, not only
because they tend to yield unreliable results, but more importantly because
they are extremely difficult to compare across different language families,
given that they have way too much freedom when compiling
them. Feature collections provided in structural datasets are an interesting
resource for diversity linguistics, but they should not be used to make primary
claims about external language history or subgrouping.

Two structural datasets for Chinese dialects

Before I start to bore the already small circle of readers interested
in these topics, it seems better to stop discussing the usefulness of
structural data at this point, and to introduce the two datasets that
were promised at the beginning of the post.

Both datasets target Chinese
dialect classification, the former being proposed by Norman (2003), and
the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets.
The main idea of this collection is to collect various structural
datasets that have been published in the literature in the past, and
allow those people interested in the data, be it for replication studies
or to thest alternative approaches, easy access to the data in
various formats.

The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for
best practice for both structural and lexical datasets in historical
linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database, glottolog.org, Hammarström et al. 2018)
and also to Nexus format. The dataset is versionized, it may be updated in
the future, and interested readers can study the code used to
generate the specific data format from the raw files, as well as the
Nexus files, to learn how to submit their own datasets to our initiative.

Final remarks on publishing structural datasets online

By providing only two initial datasets for an enterprise whose general
usefulness is highly questionable, readers might ask themselves why we
are going through the pain of making data created by other people
accessible through the web.

The truth is that the situation in
historical linguistics and language typology has for a very long time been
very unsatisfactory. Most of the research based on data did not supply the
data with the paper, and often authors directly refuse to share the
data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the
data is exacerbated by providing data only in PDF format in tables
inside the paper (or even worse: long tables in the supplement of a
paper), which force scholars wishing to check a given analysis themselves
to reverse-engineer the data from the PDF. That data is provided in a form difficult to access is not even necessarily the fault of the authors, since some journals even restrict the form of supplementary data to PDF only, giving authors wishing to share their data in an appropriate form a difficult time.

Many colleagues think
that it is time to change this, and we can only change it by
offering standard ways to share our data. The CLDF along with the
Nexus file, in which the two Chinese datasets are now published in this
open repository collection, may hopefully serve as a starting point for
larger collaboration among typologists and historical linguistics.
Ideally, all people who publish papers that make use of structural
datasets, would — similar to the practice in biology where scholars
submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues
can easily build on their results, and test them for potential errors.

List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.

2 comments:

Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively.

I'm a phylogeneticist in biology. We cannot avoid accidental sampling bias – except by simply making our datasets large enough. As long as we don't end up with redundant characters (i.e. the same character twice in different wordings), this works; there are simulation studies to show that.

Total-evidence approach!

If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical.

Then use parsimony. The behavior of parametric methods (maximum likelihood, Bayesian inference) when given datasets with realistic distributions of missing data is not well understood anyway; it hasn't really been tested.

Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data).

...Huh. The journals I've published in require authors to publish their data.

Thanks a lot for your comments. As your surprise in the last part shows: the situation is a bit different in linguistics from biology at times, although I think that we are making progress. Our datasets, however, are still very small (a dataset with 200 characters per taxon is already close to being considered as a large one), while on the other hand, parsimony enjoys a bad reputation in our field, so that barely any method that was published in the last 10 years would test it against the Bayesian methods that are generally considered to be more robust. But we'll see what the future brings, especially, if, as I hope, more data of different kinds will be shared publicly in easily accessible formats, so people can play with the data and test its usefulness for phylogenetic reconstruction.