Article Structure

Abstract

Introduction

Cognates are words in different languages having the same etymology and a common ancestor.

Related Work

There are three important aspects widely investigated in the task of cognate identification: semantic, phonetic and orthographic similarity.

Our Approach

Although there are multiple aspects that are relevant in the study of language relatedness, such

Experiments

4.1 Data

Conclusions and Future Work

In this paper we proposed a method for automatic detection of cognates based on orthographic alignment.

Topics

edit distance

Appears in 5 sentences as: edit distance (5)

In Automatic Detection of Cognates Using Orthographic Alignment

(2013) proposed a method for cognate production relying on statistical character-based machine translation, learning orthographic production patterns, and Mulloni (2007) introduced an algorithm for cognate production based on edit distance alignment and the identification of orthographic cues when words enter a new language.

Page 2, “Related Work”

Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms.

Page 2, “Our Approach”

We employ several orthographic metrics widely used in this research area: the edit distance (Levenshtein, 1965), the longest common subsequence ratio (Melamed, 1995) and the XDice metric (Brew and McKelvie, l996)4.

Page 4, “Experiments”

In addition, we use SpSim (Gomes and Lopes, 2011), which outperformed the longest common subsequence ratio and a similarity measure based on the edit distance in previous experiments.

Page 4, “Experiments”

For the edit distance , we subtract the normalized value from 1 in order to obtain similarity.

learning algorithms

We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates.

Page 1, “Abstract”

Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms .

Page 2, “Our Approach”

3.3 Learning Algorithms

Page 3, “Our Approach”

and Waterman, 1981), and other learning algorithms for discriminating between cognates and non-cognates.

n-grams

i) n-grams around gaps, i.e., we account only for insertions and deletions;

Page 3, “Our Approach”

ii) n-grams around any type of mismatch, i.e., we account for all three types of mismatches.

Page 3, “Our Approach”

We achieve slight improvements by combining n-grams , i.e., for a given n, we use all i-grams, where i E {1, ..., In order to provide information regarding the position of the features, we mark the beginning and the end of the word with a $ symbol.