Natural Language Generation for Aboriginal Languages

Aims and Background

In this project we aim to develop models and a system to generate natural language texts in the Aboriginal
language Arrernte from data in the domain of Australian Rules Football. This will provide a
framework for investigating fundamental theoretical issues in formalisms for natural language representation,
and for developing among the first language technology applications for the greatly underexplored
indigenous languages of Australia. We discuss these issues below, followed by reasons for
specific choices of each aspect of the project.

The great majority of work on understanding and manipulating languages has been carried out on languages
that are configurational—that is, languages that have a fairly rigid word order—such as English,
French, Spanish, Chinese and so on. Consequently, most formalisms for representing language, and
applications to manipulate language, are only adapted to non-configurational languages with difficulty.
Apart from the pure scientific interest of understanding a phenomenon, trying to better understand such
languages is potentially useful for at least two reasons. First, a number of major languages have nonconfigurational
aspects (free word order, null anaphora, or syntactically discontinuous expressions),
such as German and Russian; any applications involving them, such as Machine Translation (MT),
need to capture the differences from configurational languages. Second, investigating a broader range
of languages with interesting characteristics can say something about what representations are necessary
and sufficient to describe language in general. There is a strand of linguistics and computational
linguistics—including Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), and
other mildly context-sensitive grammar formalisms, as well as Generalized Phrase Structure Grammar
(GPSG) and Chomskyan linguistics pre-1973—which aims (among other things) to design a formalism
which has the minimum expressive power necessary for describing human language: the goal is
to minimise the need for (arbitrary) stipulations and potentially to allow more efficient algorithms than
formalisms with unrestricted computational power [Joshi et al., 1991]; and in parallel with this, to
provide insight into the human language processing mechanism [Rambow and Joshi, 1994]. In this
search for a suitable formalism, it was work on Swiss German and Bambara that verified that natural
languages required a more expressive computational representation than the widely used context-free
grammars. Relatedly, it was recent work on German’s non-configurational properties that uncovered
some unexpected differences between formalisms previously believed to be equivalent [Hockenmaier
and Young, 2008]. Even for formalisms with unrestricted computational power such as Head-driven
Phrase Structure Grammar (HPSG), these non-configurational languages present a major representational
challenge.

Australia is rich in non-configurational languages, and its Aboriginal languages have attracted interest
for decades, for a number of reasons. They have represented a significant new frontier of languages
previously unknown to the rest of the world; how they are related even to each other, much less to
other language groupings, is still very much an open question, unlike the case for Indo-European languages;
and in particular, they are different in many ways from other classes of languages, in terms of
phonology, morphology, syntax, and so on. As a result, linguistic investigation of Aboriginal languages
has been quite broad, covering languages from both Pama-Nyungan and non-Pama-Nyungan language
families: Warlpiri, Arrente, Pitjantjatjara, and others. These three languages named are most widely
spoken, estimated to have between 1500 and 6000 native speakers each. The communities of speakers
of these languages are engaged in measures to preserve the languages, such as through bilingual education
at schools [Hartman and Henderson, 1994]. To date, there has been at best a moderate amount
of work on a few of the languages in computational linguistics, either in analysis of the languages or in
development of applications. Reasons for adding a computational aspect to pure linguistic analysis are
twofold. First, computational linguistics can use the tools of computer science to verify the consistency
of the analyses of linguists when these are scaled up to a large proportion of a language, much as model
checking and theorem provers allow logicians to test their formalisms, as argued by Bender [2008b];
this has been the experience in building large-scale computational grammars for English, for example,
for the TAG formalism (in the XTAG project [XTAG Research Group, 2001]) and for the HPSG formalism
(in the LinGO ERG project [Bender et al., 2002]). In doing so, issues are raised concerning
what algorithms are most useful for processing these grammars; and the characteristics of unusual languages
provide a challenge to existing algorithms that can inform language processing more generally.
Second, the development of applications puts linguistic analyses to use in a way that allows them to be
evaluated by a broader range of users.

For other indigenous languages around the world there have been some recent attempts to extend purely
linguistic study and language preservation efforts to the application of computational linguistics to these
languages. The most extensive programme is for Inuktitut, by the National Research Council Institute
for Information Technology of Canada, which aims to develop information retrieval and other applications
for the First Nation peoples of Canada [Johnson and Martin, 2003]. Another is with Maori at
the University of Otago, where machine translation (MT) and human-computer dialogue applications
are being developed [Knott et al., 2003]. Both of these have as goals both the questions of scientific
interest related to the specific languages and the encouragement of language maintenance and preservation.
However, there is no similar project for any Aboriginal language, or even much connection with
Information Technology (IT) generally; witness the 2009 Puliima workshop attempting to bring
together Aboriginal languages and IT, only the second ever such attempt.

Reasons for the choice of each aspect of the project are as follows:

language: Arrernte Arrernte is divided into Western and Eastern/Central; this project will focus on
the latter, for reasons of resources and location. Arrernte is a good choice for a language of interest
because of existing work on the language, and because of its unusual characteristics. Morphosyntactic
analyses have been proposed that describe these: extensive use of morphology, fundamentally free
word order (but with word order preferences and restrictions on various subparts of the language), lack
of a copula verb, ‘quasi-inflections’ on verbs including a ‘category of associated motion’, and so on. It
is sufficiently well documented, and with sufficient existing resources, that a computational treatment
is feasible; but a system based on such a treatment is a big challenge. In terms of linguistic analysis
of Eastern Arrernte, there is good coverage of the grammar [Strehlow, 1944, Wilkins, 1989, Green,
1994, Henderson, 1998]. Arrernte also has a well-established mechanism for word-building, including
incorporation of loan words from English to supplement any lack of vocabulary in the core language
[Green, 1994, Henderson, 2002], making discourses on non-traditional topics feasible. There is also an
electronic dictionary available for Eastern/Central Arrernte [Henderson and Dobson, 1994]. Further, it
is one of the major Aboriginal languages in Australia, one of the few where children are still learning to
speak the language as their first language. Eastern/Central Arrernte has a good deal of cultural support,
for example through bilingual teaching in the Northern Territory — where it is taught inter alia as
a compulsory language at primary schools — and as the first language of an estimated 25% of the
population of Alice Springs, the urban center for much of the remote Northern Territory.

application: Natural Language Generation (NLG) We are interested in output of a high quality for
a language that has few computational resources. UnrestrictedMT, in spite of significant improvements
in the past decade, still produces quite poor output; in addition, the most successful current systems are
statistical, and Arrernte has nowhere near enough text for training such systems for high quality output.
Information retrieval applications such as for Inuktitut rely on texts for searching—this is reasonable
for Inuktitut, where for example bilingual English–Inuktitut parliamentary proceedings are mandated
by the Legislative Assembly of Nunavut, but Arrernte has many fewer texts available. NLG, unlike
other applications, does not require complete coverage of a language (which from the experience of the
XTAG project is very difficult to achieve), only coverage for the required domain and set of linguistic
constructions to be used, which can be scaled to whatever is feasible. Also, NLG systems which
generate text from numerical and historical data are well established and quite successful: two current
ones are Baby Talk [Portet et al., 2007], where starting from data on heart rate, blood pressure, O2
and CO2 levels in the blood, respiration rate, etc, the system produces a text-based description of
the sort that a nurse might read; and SumTime
[Reiter et al., 2005], where numerical data such as
found in weather predictions is translated into the sort of short text you might read in the newspaper.
Note also that here we are interested in generating entire articles, not just sentences, necessitating an
understanding of information structure, discourse and narrative as well as syntax. This is an interesting
yet underexplored area for computational treatment of indigenous languages.

domain: Australian Rules (AFL) football Language technology applications are generally more
successful in limited domains. In particular, as noted above, generating from a combination of numerical
data (such as game scores) and historical data (such as player information) is quite well established:
the technique was introduced for basketball box scores by Robin [1994] and has been extended to other
domains such as stock exchange data [Reiter and Dale, 2000] and the above-mentioned medical and
weather data. We note that there is widespread interest in AFL in Aboriginal communities. See
for example http://www.aboriginalfootball.com.au, or the discussion of the prominence of AFL
football in everyday life and in events such as the Yuendumu Games in Tatz [1987]

As a practical outcome of this system, we would like to see the generation of football articles that could
be used to help in developing literacy in Arrernte. There are existing readers used for this purpose; we
would see the articles generated by our system as supplementing these.

Overall, then, the specific aims of this project are as follows:

To verify the consistency of existing analyses of Arrernte through a large-scale implemented
grammar, and investigate what unexpected new analyses might need to be developed based on
coverage requirements; and also, complementarily, to examine how these will inform the requirements
of linguistic formalisms.

To investigate what kinds of syntax–semantics–information structure–discourse interfaces are
required for end-to-end language processing of Arrernte; and also investigate what kinds of new
data structures and efficient algorithms can be designed for non-configurational languages within
this context.

To investigate how the differences between configurational and non-configurational languages
will affect the standard architectures for generation from numerical and historical data.

To develop a system that can generate Arrernte-language texts that would be of interest to Arrernte
speakers, and that could be used in efforts to maintain the language and promote literacy
among those speakers.

Emily Bender, Dan Flickinger, and Stephan Oepen. The Grammar Matrix: An Open-Source Starter-Kit for the
Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision Grammars. In John Carroll,
Nelleke Oostdijk, and Richard Sutcliffe, editors, Procedings of the Workshop on Grammar Engineering and
Evaluation at the 19th International Conference on Computational Linguistics, pages 8–14, Taipei, Taiwan,
2002.

Julia Hockenmaier and Peter Young. Non-local scrambling: the equivalence of TAG and CCG revisited. In
Proceedings of The Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms
(TAG+9), T¨ubingen, Germany, June 2008.

Beryl Hoffman. The Computational Analysis Of The Syntax And Interpretation Of “Free”Word Order In Turkish.
PhD thesis, University of Pennsylvania, 1995.

Howard Johnson and Joel Martin. Unsupervised Learning of Morphology for English and Inuktitut. In Proceedings
of Human Language Technology and North American Chapter of the Association for Computational
Linguistics Conference (HLT-NAACL’03), Edmonton, Canada , May 2003.

Alistair Knott, J. Moorfield, T. Meaney, and L. Ng. A human-computer dialogue system for M¨aori language
learning. In Proceedings of the World Conference on Educational Multimedia, Hypermedia and Telecommunications
(ED-MEDIA), June 2003.

Juntae Yoon, Chung hye Han, Nari Kim, and Mee sook Kim. Customizing the XTAG system for efficient
grammar development for Korean. In Proceedings of the Fifth International Workshop on Tree Adjoining
Grammars and Related Formalisms (TAG+5), 2000.