Project Members

Michael Cysouw

Hagen Jung

Quantitative approaches to lexical comparison

Lexical material ("words") is one important source of information to establish genealogical relations between languages. We investigated quantitative methods to assist linguists in this kind of historical-comparative research.

The comparison of words (i.e. strings of characters) is strongly reminiscent of the comparison of strings of DNA. The major difference is that the strings of DNA are normally much longer than words in human language. This means that in principle there is more information in the DNA-strings to properly assess their similarities. In contrast, each character of a linguistic word is much more informative than the 'letters' of DNA. In DNA there are only four letters (A, C, G, T), while a human language has between 15 and about 100 different letters (phonemes). This means that each individual character in strings of human language carries more information than the 'letters' of DNA.

We expected that these different kinds of data are roughly equally informative, and consequently we were adapting approaches from DNA comparison for the comparison of words in human language. As for the data, we were using the various wordlists that were collected in our department to investigate and test different kinds of quantitative methods of lexical comparison.

Selected recent publications produced within this project

Cysouw, Michael & Hagen Jung. 2007. Cognate identification and alignment using practical orthographies. Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, 109-116.