Computer-Assisted Language Comparison (CALC)

CALC

Welcome to the CALC Project

The ERC-funded research project CALC (Computer-Assisted Language Comparison, see here
for the official research proposal)
establishes a computer-assisted framework for historical linguistics. We pursue an
interdisciplinary approach that adapts methods from computer science and bioinformatics
for the use in historical linguistics. While purely computational approaches are common
today, the project focuses on the communication between classical and computational
linguists, developing interfaces that allow historical linguists to produce their data
in machine readable formats while at the same time presenting the results of
computational analyses in a transparent and human-readable way.

[READ MORE]

Introducing CALC

By comparing the languages of the world, we gain invaluable insights into human
prehistory, predating the appearance of written records by thousands of years. The
traditional methods for language comparison are based on manual data inspection. With more
and more data available, they reach their practical limits. Computer applications,
however, are not capable of replacing experts' experience and intuition. In a situation
where computers cannot replace experts and experts do not have enough time to analyse the
massive amounts of data, a new framework, neither completely computer-driven, nor ignorant
of the help computers provide, becomes urgent. Such frameworks are well-established in
biology and translation, where computational tools cannot provide the accuracy needed to
arrive at convincing results, but do assist humans to digest large data sets.

As a litmus test which proves the suitability of the new framework, the project will
create an etymological database of Sino-Tibetan languages. The abundance of language
contact and the peculiarity of complex processes of language change in which sporadic
patterns of morphological change mask regular patterns of sound change make the
Sino-Tibetan language family an ideal test case for a new overarching framework that
combines the best of two worlds: the experience of experts and the consistency of
computational models.
[READ
LESS]

NewsArchive

New Blog Post on Authority Arguments (Post from 21.09.2017 by J.-M. List)

Two days ago, I wrote another blogpost, this time in Arguments from authority, and the Cladistic Ghost, in historical linguistics. This may look like an offensive argument I make there, but my major intention was to draw the attention to the fact that our "classical" comparative method was never classical in any sense, as it is just a label we use to denote what we do to compare languages, and that, in the light of new approaches, we should not be too dismissive, but rather try to work harder on integrated, computer-assisted frameworks, which will hopefully enable us to understand better, how our languages evolved into their current shape.

New Blog Post on Authority Arguments (Post from 21.09.2017 by J.-M. List)

[MORE]

Two days ago, I wrote another blogpost, this time in Arguments from authority, and the Cladistic Ghost, in historical linguistics. This may look like an offensive argument I make there, but my major intention was to draw the attention to the fact that our "classical" comparative method was never classical in any sense, as it is just a label we use to denote what we do to compare languages, and that, in the light of new approaches, we should not be too dismissive, but rather try to work harder on integrated, computer-assisted frameworks, which will hopefully enable us to understand better, how our languages evolved into their current shape.

[LESS]

DLCE, CALC, and University Jena Co-Organize Workshop at the Poznań Linguistic Meeting (Post from 15.09.2017 by J.-M. List)

[MORE]

The DLCE (Cormac Anderson, Paul Heggarty) CALC (Johann-Mattis List), and Friedrich Schiller University Jena (Adrian Simpson) are co-organizing a workshop as part of the Poznań Linguistic Meeting on Monday, September 18. For more information, see the workshop website which has just been launched.

[LESS]

New Paper on Annotation in Historical Linguistics (Post from 14.09.2017 by J.-M. List)

[MORE]

I am proud to announce that a paper in which me and Nathan Hill discuss Challenges of Annotation and Analysis in Computer-Assisted Language Comparison has now been published online and can be freely downloaded form this link. The paper discusses general challenges of annotation for the purpose of historical language comparison and also introduces first ideas on how to solve these challenges. Here is the abstract:

The use of computational methods in comparative linguistics is growing in popularity. The increasing deployment of such methods draws into focus those areas in which they remain inadequate as well as those areas where classical approaches to language comparison are untransparent and inconsistent. In this paper we illustrate specific challenges which both computational and classical approaches encounter when studying South-East Asian languages. With the help of data from the Burmish language family we point to the challenges resulting from missing annotation standards and insufficient methods for analysis and we illustrate how to tackle these problems within a computer-assisted framework in which computational approaches are used to pre-analyse the data while linguists attend to the detailed analyses.

[LESS]

Radio Interview on Language Diversity (Post from 11.09.2017 by J.-M. List)

[MORE]

Last week, I gave a radio interview with Deutschlandfunk Nova in which I tried my best to answer questions regarding language diversity and its driving forces. The interview, which was broadcasted yesterday, can also be found online under this link.

[LESS]

Schedule and Abstracts for DOT Panel on Historical Linguistics Online (Post from 07.09.2017 by J.-M. List)

CALC and DLCE Organize Panel at the Deutscher Orientalistentag (Jena) (Post from 06.09.2017 by J.-M. List)

[MORE]

The DLCE and CALC are organizing a panel on the Deutscher Orientalistentag, which will take place in Jena this year (September 18-22). On September 21, from 9am to 1pm scientists from the institute and external guests will share and discuss their thoughts on the topic "Languages as keys to our past".

We will soon provide more information on the list of speakers and their abstracts.

[LESS]

New Blog Posts for August (Post from 17.08.2017 by J.-M. List)

[MORE]

I wrote two new blogposts in August, one in German on the benefits of using alignments and similar visualization techniques more broadly in the media, which you can find here, and one in English, where I discuss the problem of unattested character states in phylogenetic reconstruction, specifically in linguistics, which you can find here.

[LESS]

Yunfan Lai's PhD thesis is now online (Post from 10.08.2017 by Y.-F. Lai)

[MORE]

Hi there. I defended my thesis back in June, but after a large gap, I failed to motivate myself to upload it online. Now I finally did it.
You may now have a look at my thesis here.
Have fun!

[LESS]

Talk at the Human Document Project (Post from 09.08.2017 by J.-M. List)

[MORE]

Last week, I visited the Human Document Project 2017 in Freiburg, a project that seeks to preserve information about humans beyond the existence of the human race. As scify as this may sound on the first sight, as interesting it is, how many different questions and disciplines need to be involved into the plan of creating a time capsule that could witness of our existence even if we, that is, humanity, no longer exists. They invited philosophers, artists, technicians, data-experts, informaticians, physicists, and also me, as a linguist, whose job it was to give a rough overview on linguistic diversity and how we try to represent our knowledge about it. Although my talk, titled Storing our knowledge of linguistic diversity: Towards the standardization of cross-linguistic data formats did not involve the longer perspective of the next million years, I had the impression that it triggered the interest of the colleagues. While I remain sceptical about the general usefulness of science fiction questions in science, I have to admit that the day I spent in Freiburg was very inspiring, as I learned so many new things. Maybe, in the end, this is even the more important aspect of the HUDOC project: bringing together people from different disciplines and having them talk with each other...

[LESS]

New Post-Doc in CALC (Post from 02.08.2017 by J.-M. List)

[MORE]

It is my pleasure to welcome Yunfan Lai as a post-doc in the CALC project. He has a lot of experience in working with Sino-Tibetan languages and devoted his PhD to Khroskyabs, a very interesting branch of Sino-Tibetan whose history is still not clearly understood. As a member of CALC, Yunfan will pursue his studies on Khroskyabs and related varieties, and also provide help to uncover the mysterious history of Sino-Tibetan.

[LESS]

Back from Holidays (Post from 01.08.2017 by J.-M. List)

[MORE]

Having been traveling for about two weeks, interrupted by a talk I gave in Cologne, I am now back at work and finally find time to announce some news on what happened recently. First, there are two new blogposts I wrote, one in English on similarities in linguistics, a follow-up to a blogpost I devoted to the same topic earlier this year. The other blogpost in German is devoted to impoliteness (Unhöflichkeit) in Chinese and other languages. Second, there is the talk I gave together with Nathan W. Hill in Cologne, on a workshop on the regularity of sound change, organized by Eugen Hill and Robert Mailhammer. In our talk, titled "Computer-assisted approaches to linguistic reconstruction" , we outlined a new framework for automated linguistic reconstruction which we illustrated with examples from the Burmish languages.

[LESS]

Three new talks during a busy week (Post from 09.07.2017 by J.-M. List)

[MORE]

From Friday, 30th of June, until last Friday, 7th of July, I was giving three talks on three different topics. It started with a summary on the potential of networks approaches in Old Chinese reconstruction in Paris, after which I was very surprised that many scholars seem to support the idea of handling Chinese character formation with directed networks (and I hope that I will find time to address this soon, even if only in a small example). After that, I gave a talk in Liège on colexifications and cross-linguistic polysemies, and how we plan to update the CLICS database when we launch CLICS 2.0. Finally, I introduced some basic ideas on how to handle lexical and etymological data within the Cross-Linguistic Data Formats initiative, focusing specifically on annotation and analysis. Although it was quite exhaustive to prepare all these talks, I am glad that I scheduled them for this time, since it forced me to push a couple of important projects, such as cross-linguistic colexifications, and the cross-linguistic data formats, which are all central for computer-assisted language comparison in general, and also important for Sino-Tibetan in specific.

After one week in Jena, where I'll try to catch up with the work I could not finish yet, I'll finally have two weeks of holidays until beginning of August, interrupted only from another talk in Cologne next week on Friday, in which me and Nathan Hill will present some interesting new work on Burmish languages (I'll report later in more detail).

[LESS]

New Blog Posts and Papers (Post from 29.06.2017 by J.-M. List)

[MORE]

I have published a couple of new papers recently, but since they go back to my former research project and were not directly developed as part of the CALC project, I do not list them in the list of papers. They are, however, quite important for our research, since they both deal with Old Chinese reconstruction.

The first paper is in "Using network models to analyze Old Chinese rhyme data" and will soon officially appear in the Bulletin of Chinese Linguistics. In the meantime, you can find my author's copy here.

The second paper is on "Vowel purity and rhyme evidence in Old Chinese reconstruction" (common work with my colleagues from Paris and London, Jananan S. Pathmanathan, Eric Bapteste, Philippe Lopez, and Nathan W. Hill) finally came out today, and you can find the PDF for download here.

I also wrote two more blog posts, both devoted to language comparison in general and my view on computer-assisted language comparison in particular. The first blog post (in English) is titled "Trees do not necessarily help in linguistic reconstruction" can be found here, and the second one deals with sound change and explains them with help of tooth-loss in comic books and can be found here.

[LESS]

New Papers Accepted (Post from 15.06.2017 by J.-M. List)

[MORE]

Two new papers have been accepted during the last two weeks, and I am very glad about both publications, since they cover topics that touch the core of my project on computer-assisted language comparison.

The first is joint work with Nathan W. Hill (SOAS, London), and titled "Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages". In this paper, we point to general annotation challenges when analysing South-East Asian languags in which compounding is frequent and sound correspondences are often hard to discover. We present a new database of cognate sets across 8 Burmish languages, all coded for partial cognacy, and consistently aligned. The final version of the paper which we submitted as our final version to the Yearbook of the Poznań Linguistic Meeting is available here.

The second paper is joint work with Gerhard Jäger (University Tübingen), and concentrates on a problem which is often overlooked in the literature, namely the problem of how well current algorithms infer which word forms where used to express a given concept in ancestral, unattested languages. This is not a trivial problem, and we only address it from the perspective of the classical lexicostatistical word lists, where we test on three datasets (Indo-European, Austronesian, and Chinese) how well different algorithms infer the ancestral states as they are predicted by the gold standard (the proto-forms provided along with the datasets). It turns out that the algorithms do not perform very well (unfortunately, MLN, an algorithm on which I worked a lot myself, performs even worst), but when looking at the gold standard in detail, we realized that many of the errors are due to problems with the gold standards, which are themselves quite inconsistent and not very trustworthy. As a result, we think that using ancestral state reconstruction methods for this purpose of "onomasiological reconstruction" might actually really help to get a better estimate. The draft of the paper can be found here.

[LESS]

Final Report of SinDial Project (Post from 19.05.2017 by J.-M. List)

[MORE]

My DFG-funded research project on Vertical and lateral aspects of Chinese dialect history officially ended on December 31, 2016. From January 2015 until December 2016 I had two very interesting but also challenging years during which I made acquaintance with many different scholars from different disciplines and countries but also with many new approaches and methods to historical linguistics and related disciplines.

Having submitted my final report in April (first time for a long time I wrote in German again), and hoping that the reviewers do not have anything grave to complain about, I now published the report online with Zenodo, and you can find it online here.

In case you wonder why I recommend this final report in the context of the CALC project, the answer is simple: Much of the ideas that I put into the project application for the CALC project were developed while I was in Paris, funded by the DFG, so in some sense, the SinDial project on Chinese dialects was the root of CALC.

[LESS]

LingPy-Tutorial at the Quantitative Methods Spring School (Post from 15.05.2017 by J.-M. List)

[MORE]

Last week, we had a spring school on Quantitative Methods here in Jena. This is an annual event, and it was the second time that it took place, with Fiona Jordan organizing the main event, and many interesting scientists coming here as tutors or students for one full week (seven days), which was quite exhausting but also very interesting.
This time, I gave a tutorial on LingPy, introducing the basic ideas of automatic sequence comparison and how it can be used to get started on computer-assisted work flows. You will find the tutorial online here in form of an Ipython Notebook, but you can likewise download the pdf or follow my introductory slides. All in all, this tutorial will provide you with all the most recent information needed to start making your own analyses with LingPy.

[LESS]

Mini-Workshop on Poetry (Post from 22.04.2017 by J.-M. List)

[MORE]

On Thursday, last week, we had a mini-workshop on poetry for which we invited colleagues from the Max Planck Institute for Empirical Aesthetics and from the University of Zurich. It may look strange on first sight why poetry would matter for computer-assisted language comparison, but the poetic tradition of rhyming in the history of Chinese in fact plays a crucial role for the reconstruction of the oldest stages of the languages. I myself devoted two recent studies to the application of network approaches to study Old Chinese phonology which are currently in the final phase of editing and will hopefully appear soon (the draft for one study can be found here). In my talk, I presented this research quickly (the slides are here), and pointed to future questions on the dynamics underlying the development of poetic traditions from a cross-linguistic and historical perspective.

The other speakers discussed many interesting topics, ranging from empirical studies on poetry and how one can annotate the important factors that constitute poetic speech (Winfried Menninghaus and Christine Knoop, MPI-AE), via the automatic detection of rhyme patterns in German poetry (Thomas Haider MPI-AE), up to tquestions of language contact and cultural exchange (Paul Widmer, UZH), and the co-evolution of linguistic and poetic forms (Cormac Anderson, MPI-SHH). Our discussions during the talks were long, and since we had to stop at some point, there was no time for the talk by Olivier Morin (MPI-SHH) on "poetry as super-week communication". This was a definit loss, as I saw when Olivier shared his slides afterwards, but luckily we are working in the same department, and nothing will prevent us to go on with discussions and exchange of ideas.

We all decided to stay in close contact and keep each other informed on future ideas as well as concrete research, and it is quite likely that at some point in the not-so-far future, I will present more of this here.

[LESS]

Mini-Workshop on Sino-Tibetan Phylogenies in Zürich (Post from 12.04.2017 by J.-M. List)

[MORE]

We had an interesting small workshop in Zürich where my former colleagues from
Paris, Laurent Sagart, Guillaume Jacques, and Yunfan Lai, with whom I pursue
the goal to establish a larger lexicostatistic database of Sino-Tibetan
languages, as well as people from Balthasar Bickel's team were present. We
presented our respective work we have done so far, and I myself gave a talk on
my ideas regarding a Sino-Tibetan Lexicostatistic
Database.
We will all keep collaborating in the future and potentially organize a second
meeting, either in Paris or in Jena, later during this year.

[LESS]

Attending the EACL Conference in Valencia (Post from 02.04.2017 by J.-M. List)

[MORE]

Next week, I will attend the conference of the European Chapter of the Association of Computational Linguistics.
After Lyon in 2012, this is my second EACL, and I will be involved in two presentations, one together with Gerhard Jäger and Pavel Sofroniev on automatic cognate detection, and one where I present the current state of my EDICTOR tool for computer-assisted language comparison.

[LESS]

Project Website Online (Post from 31.03.2017 by J.-M. List)

[MORE]

In time with the official start of the project, we are glad to announce that the official project website is now online.
It is without question that this website will be refined during the project duration, but the basic infrastructure is now there, and those interested in our project will be able to follow our news.

Resources Developed in CALC

Apart from analyses presented in form of papers and talks, CALC will produce various kinds of
resources which help colleagues to pursue computer-assisted research themselves. Please don't
hesitate to contact us whenever you have questions regarding the resources which you will find on
this side. In most cases, you will find more detailed information when following the links to the respective
project pages.

List, J.-M. (2017): A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations. 9-12.

Presentations on CALC

Talks from 2017

Lai, Y. and J.-M. List (2017): The Sino-Tibetan language family. What we know, what we can know, and what we know we cannot know. Paper, presented at the workshop "Languages as Keys to Our Past, organized as part of the DOT 2017" (2017/09/21, Jena).

List, J.-M. (2017): Languages as keys to our past. How classical and computational approaches to language comparison help us to shed light on the past of our languages. Paper, presented at the workshop "Languages as Keys to Our Past, organized as part of the DOT 2017" (2017/09/21, Jena).

List, J.-M. (2017): Computer-assisted language comparison. Reconciling classical and computational approaches to historical linguistics. Talk, held at the "Institute for Oriental and Classical Studies" (2017/03/21, Moscow, Russian State University for the Humanities).

Dr. Johann-Mattis List (Group Leader)

[MORE]

[LESS]

In my research, I generally take a data-driven, empirical, and quantitative perspective on language change and language history, with a special focus on South-East Asian languages. In contrast to pure computational approaches, however, I try to keep my research closely connected to traditional historical linguistics and linguistic theory, following a computer-assisted rather than a computer-based framework of quantitative research in historical linguistics.