SIGMORPHON 2016 Shared Task: Morphological Reinflection

Overview

The shared task has concluded! Thanks to all those who particpated. All data (including the test sets) will be hosted on this site.
Please read Cotterell et al. (2016) for a detailed analysis of
submitted systems and the results.

In 2015-2016, SIGMORPHON
is hosting a shared task on morphological reinflection. An
example of English reinflection is the conversion of ran to its
present participle, running.

To participate in the shared task, you will build a system that can
learn to solve reinflection problems. All submitted systems will be
compared on a held-out test set.

You will be invited to describe your system in a short paper for
the SIGMORPHON 2016 workshop. The task organizers will write an
overview paper that describes the task and summarizes the different
approaches taken and their results.

Shared Task Paper

Submission

We have released the test data! It is in the same format as the
training and dev data with the exception that the last column has
been omitted. Please run your system for each language and each task
for which you wish to submit an entry into the competition. The
output format should be a text file identical to the train and dev
files for the given task. Essentially, you will be adding the
missing last column of answers to the test files. Note that you may
submit multiple predictions for a given row and we will measure mean
reciprocal rank. If you do submit mutiple ordered guesses, please
output multiple lines with differing last columns; the order in the
file will be the order in which we rank them.

Email the resulting text files
to sigmorphon.sharedtask.2016@gmail.com
with the subject in the format: INSTITUTION--XX--Y, where you should
replace institution with the name of your institution and XX with an
integral index (in case of multiple systems from the same
institution). In the case of multiple institutions, please place a
hyphen between each name. If there are any additional details you
would like us to know about your system or resources you used, please
write a short description in the body of the email. The Y should
specify either 1, 2, or 3, depending on which data you are using to
solve the task. These three categories are:

1 = Standard: The solution to task 1 may only use task 1
training/development data. Anything else is considered "using a bonus
resource". Likewise, the solution to task 2 may only use task 1 and
task 2 training/developent data. Anything else is considered "using a
bonus resource". The solution to task 3 may only use task 1, 2, and
task 3 training/development data. Anything else is considered "using a
bonus resource".

3 = Bonus: tasks are solved using the Standard approach, drawing
also possibly on higher task number related training data and/or extra
unlabeled data given on the website.

Please name your solution files "LANG-task#-solution", for example
"finnish-task1-solution", etc. We encourage participants to send
one email per category, with a single attached archive file
containing the solutions for all languages and tasks solved. So, if
you are solving all tasks with approach "Standard" (1), all the
solutions can be communicated with one email with all your
"LANG-task#-solution" files as an archive.

Submissions are due at 11.59pm (anywhere in the world) on April 28, 2016 (Extended).

Downloads

Inflectional Morphology

A word's form reflects syntactic and semantic features that are expressed by the word. For example, each English count noun has both singular and plural forms (robot/robots, process/processes), known as the inflected forms of the noun. A Polish verb may have nearly 100 inflected forms.

NLP systems must be able to analyze and generate all of these inflected forms. Fortunately, inflected forms tend to be systematically related to one another. This is why English speakers can usually predict the singular form from the plural and vice-versa, even for words they have never seen before.

The Tasks

There are actually three similar tasks. Your system may compete on any or all of the three tasks. Training examples and development examples will be provided for each task.
For each language, the possivble inflections are named by a given finite set of morphological tags.

Task 1 – Inflection

Given a lemma (the dictionary form of a word) with its part-of-speech, generate a target inflected form.

Task 2 – Reinflection

Given an inflected form and its current tag, generate a target inflected form.

English example

Source tag: Past
Source form: ran
Target tag: Present participle

Output: running

Task 2 is a harder case of Task 1, since the source tag is no longer guaranteed to be Lemma.

Task 3 – Unlabeled Reinflection

Given an inflected form without its current inflection, generate a target inflected form.

English example

Source tag: not given
Source form: ran
Target tag: Present participle

Output: running

Task 3 is a harder case of Task 2, since the source tag is no longer provided.

When solving a task, participants may use training data for lower-numbered tasks without it being considered to be a bonus resource. That is, when solving task 2, using task 1 data is permitted. Likewise, when solving task 3, both task 2 and task 1 training data can be used. We encourage participants to, if possible, run various systems, and report which training data they have used for task 2 and 3. Knowing how well task 2 (or 3) can be solved using only task 2 (or 3) data as opposed to also using data from lower-numbered tasks is valuable extra information.

Some Possible Strategies

All of these are sequence-to-sequence mapping problems. If you have a general supervised method for learning such mappings, you can simply throw it at all of these tasks.

Alternatively, you can solve the tasks in sequence. For instance, reduce Task 2 to Task 1 by recovering an inflected form's lemma given its tag, and then reduce Task 3 to Task 2 by recovering an inflected form's tag.

An inflectional paradigm is a table that lists all inflected forms for some lemma. Rather than treating the training examples as independent, you could assemble them into partial paradigms based on shared input or output forms. You could then jointly analyze the partial paradigms to better discover latent structure in the observed forms and to better extrapolate to unobserved forms (Dreyer and Eisner 2009, 2011; Durrett and DeNero 2013; Hulden et al. 2014; Nicolai et al. 2015).

A Baseline System

We provide a baseline system that can be used as a starting point for experiments, or simply for comparison. The system implements a discriminative string transduction, similar in spirit to other recent approaches such as Durrett and DeNero (2013) and Nicolai et al. (2015). The implementation and a description is available here.

Bonus Resources

When evaluated on a given (task, language) pair, your system is permitted to consult the provided training data for that pair. Your system is also permitted to consult the following additional resources, but no other resources. Participants need to clearly indicate if they are using the unlabeled corpora in their approach. We want to separate participation into two categories - those that only use the example inflection data, and those that take advantage of unlabeled data as well.

Resources that we will provide for this language:

A large untagged monolingual text corpus.

The list of possible tags.

A description of each tag as a set of attribute=value pairs, e.g., [Person=3,Number=plural,Tense=Past]. These features are useful for generalization in languages with large tagsets.

The test inputs for all tasks in the same language (transductive learning). The Task 2 example shown above reveals that ran can be a Past form. This could help Task 3 predict the Past form of runs.

Note that, as described above, using lower-numbered task training data is not considered a bonus resource. Task 2 may use task 1 data, and task 3 may use task 1 and 2 training data.

It is not required to use these bonus resources. They are permitted in order to make the task more realistic, to allow more freedom to develop interesting approaches, and because it would be difficult to exclude their use.

We encourage participants to experiment with various approaches and to document clearly which training data and bonus resources were used.

Evaluation

Your system should predict a single string for each test example.
Optionally, you may also produce a ranked list of up to 20 predictions for each test
example.

We will distribute an evaluation script for your use on the development data. The script will report:

The script will also provide some analysis of errors, e.g., according to whether the correct output appears in the monolingual corpus.

You are encouraged to do ablation studies to measure the advantage
that you gained from using bonus resources or from particular
innovations. You should perform these studies on the development data
and report the findings in your paper.

We will use the same script to evaluate your system's output on the
test data. If multiple answers are correct, we will use the answer
that gives you the higher score. For example, in Task 1, the two
senses of English lemma hang have
different Past forms, hung
and hanged. In Task 3, the English verb lay
could be a Present or Past form,
of different verbs whose Past participle forms are respectively laid and lain.

We will evaluate on each language separately. An aggregate evaluation
will weight all languages equally, including the 2 surprise
languages.

In the overview paper, we will also compare the systems to one
another. We will evaluate

which systems are significantly different in performance

which examples were hard and which types of systems succeeded on them

which systems would provide complementary benefit in an ensemble system

The Languages

We have chosen a diverse set of 10 languages, mostly languages with rich inflection.
All of the datasets have been scraped from Wiktionary and undergone additional
processing at the Center for Language and Speech Processing at Johns Hopkins University. The data are formatted according to the schema described in Sylak-Glassman et al. (2015).

For all languages, the data consist of orthographic strings (written spellings), not phonological strings (pronunciations).

Spanish: Spanish is a language of the Romance branch of the larger Indo-European family, that originated in the Castile region of Spain. More than 400 million people speak Spanish as a native language, making it second only to Mandarin in terms of its number of native speakers worldwide.

German: German is a West Germanic language that derives most of its vocabulary from the Germanic branch of the Indo-European language family. It is spoken by 90 million people in central Europe.

Finnish: Finnish is the language spoken by the majority of the population in Finland. It has 5.4 million speakers.

Russian: Russian is an East Slavic language and an official language in Russia, Belarus, Kazakhstan, and Kyrgyzstan. It is an unofficial but widely spoken language in Ukraine, Moldova, Latvia and Estonia. Russian belongs to the family of Indo-European languages and is one of the three living members of the East Slavic languages. It has 150 million native speakers and 260 speakers total.

Turkish: Turkish is the most widely spoken of the Turkic languages, with around 63 million native speakers. Speakers are located predominantly in Turkey, with smaller groups in Germany, Bulgaria, Macedonia, Northern Cyprus, Greece, the Caucasus, and other parts of Europe and Central Asia.

Georgian: Georgian is a Kartvelian language spoken by Georgians. It is the official language of Georgia and spoken by 4.3
million people.

Navajo: Navajo is a language of the Athabaskan branch of the Na-Dené family, by which it is related to languages spoken across the western areas of North America. Navajo is spoken primarily in the Southwestern United States, especially in the Navajo Nation political area. It is one of the most widely spoken Native American languages with almost 170,000 Americans speaking Navajo at home as of 2011.

Arabic: Modern Standard Arabic (MSA) is the literary standard across the Middle East, North Africa, Horn of Africa and one of the six official languages of the United Nations. It is spoken by over 200 million people.

Maltese (Surprise Language 2): Maltese is a Semitic language closely related to Arabic. It is spoken natively on the island of Malta by 520,000 people.

Data format

The training and development data is provided in a simple utf-8 encoded text format where each line in a file is an example that consists of word forms and corresponding morphosyntactic descriptions (MSDs) provided as a set feature/value pairs. The fields on a line are TAB-separated.