Following a successful application to round four of the T-AP Digging Into Data Challenge, an international team of Assyriologists, computational linguists, and computer scientists will collaborate in tackling challenging questions of cuneiform language processing and analysis using machine learning technologies.

Ancient Mesopotamia is recognized today as the home of the first empires and the birthplace of writing. Mesopotamian history and culture are best known through literary texts and royal inscriptions, but legal and administrative texts, making up well over 90% of all cuneiform documents, have received much less attention: even when transliterated and digitized, most of them remain untranslated and therefore inaccessible to scholars in related fields, and to an interested public. And yet cuneiform receipts and accounts are unique and insightful historical witnesses, insofar as they document the day-to-day management of early state economies. Because of their vast numbers, the full translation of this corpus by experts appears to be beyond reach. Transliterations are routinely produced but they remain difficult to interpret, except by specialists in that particular subset of documents. Our project aims to contribute new tools and methods that will help to overcome these difficulties by addressing the major roadblocks in natural language processing of cuneiform languages; these can then be applied to other ancient languages. The resulting tools, together with the translations and the historical, social and economic data extracted from them, will be made available in open access and in machine readable format.

The methods currently available for Natural Language Processing (NLP) of languages written in cuneiform are solely rule and dictionary-based. Because of the complex morphology of cuneiform languages, the processes are dependent on human intervention in order to obtain reliable results. When dealing with large corpora, unless the sources are very homogeneous, the time required to verify each line of text renders this type of approach impracticable. Moreover, either the current methods are not context aware, or the existing context is not sufficiently specialized to compensate for the resulting levels of inaccuracy. Machine learning can help to detect subtle patterns in order to perform the contextual disambiguation that is a crucial element in machine recognition of the information present in the texts. These advanced methods are already present in the toolbox available for research in modern languages, and computational linguists are working to develop these tools to accommodate the processing of extinct languages. Our project will see the first application of these methods to languages written in cuneiform.

The MTAAC project’s goal is to address this gap in the natural language processing of cuneiform languages. More specifically, the project’s objectives are:

to formulate, test and evaluate methodologies for the automated analysis and machine translation (MT) of transliterated cuneiform documents, and to make the technology thus developed available to specialists in the field;

to make available the translation of a specific and representative set of cuneiform documents to scholars in related disciplines and to a networked public (see below);

to provide new data for the study of the language, culture, history, economy and politics of the ancient Near East by harvesting the linguistic by-products of the translation and information extraction processes;

to formalize these new data utilizing Linked Open Data (LOD) vocabularies, and to foster the practices of standardization, open data and LOD as integral to projects in digital humanities and computational philology.

As a representative and robust test set of cuneiform documents, for the initial phase of MTAAC we have chosen the corpus of Ur III (21st century BC) administrative texts. These documents represent the best candidates for machine learning experiments due to their simple syntax, homogeneity, and imposing numbers: nearly 68,000 texts with 1.5 million lines in Canonical ATF (oracc.museum.upenn.edu/doc/help/editinginatf/), 20,000 of them in translation, are maintained by the Cuneiform Digital Library Initiative (CDLI), a project that has substantial expertise in the interpretation of this and related cuneiform corpora.