Project ideas

If you are new to the CLTK and NLP, and would like to contribute code, we ask that you first consider working on one of our Beginners' exercises, so to get a feel for what NLP is and how we apply it to ancient languages.

GSoC projects

Extend CLTK core to new language. Implement NLP functionality (tokenizer, POS taggers, stopwords, stemmer/lemmatizer, etc.) for a major Classical language, among which we non-exclusively consider Sanskrit, Classical Chinese, Old/Middle English, and Biblical Hebrew to be the most notable, due to their large surviving texts and currently available digital resources. However, we are open to any language on our List of Classical Languages. This task is more about breadth (adding as much functionality as possible) instead of depth (e.g., a novel solution to one particular problem). Your GSoC application will be judged on how you demonstrate the following:

what NLP functionality you will be able to add;

the relative priorities of the new functionality;

what kinds of software and algorithms you will be using;

the linguistic or scholarly pedigree of your chosen software and algorithms;

what open source linguistic data (corpora, word lists, dictionaries, etc.) you will use to accomplish your goals;

for the data, how much preparation will be required. Concerning this last point, please remember that GSoC is about code, not data cleaning. We are not able to accept otherwise brilliant applications that also requires 6 weeks of data annotation or cleanup. If you believe you are able to do your data prep during application period or Community Bonding period, please explain that.

Write extension to Draft.js. For an example of what we are trying to create, see the annotation "layers" on the right-hand side of a page from the Eighteenth-Century Poetry Archive. This year's project builds off two previous summers' work to (2016, "Develop JavaScript frontend and Python backend to CLTK webapp") and (2017, "New frontend functionality"; see below for both). The first summer built a reusable framework for Classical texts (code here) and the second created a reusable annotation library for linguistic analysis (see main annotations repo here plus text server and API). We are looking to consolidate this past work into a reusable package which allows for users to arbitrarily annotate "layers" (e.g., semantic, syntactic, morphological) with arbitrary metadata. For the CLTK, the Draft.js library allows one to visualize complex text and metadata relationships which can be leveraged for visualizing NLP annotations from the CLTK core functionality. However, looking beyond the needs of the CLTK, by the end of the summer, we expect that your code will be published on npm as a fully abstracted library, ready for use by anyone using any kind of text and in any language.

Miscellaneous projects

Make good installation and use docs for SyntaxNet. SyntaxNet is not a CLTK project, however some of the pre-built models are for ancient languages. Write very easy-to-follow installation docs for Mac and latest Ubuntu. Also give examples of how to use its API with examples that a real researcher might be interested in.

Add corpora for a Classical language. The CLTK relies on language specialists to identify and collect the best available data sets for a given language. While these corpora are usually text, they are also often tagged data (e.g., part of speech and syntax), lexica, word lists, etc.. See the CLTK wiki page on the relatively easy process of adding corpora to the organization.

Develop machine translation interface to Moses. Because enormous amounts of Classical literature have never been translated into a modern language, building translation models is of great importance and pertinence to the CLTK. Moses is leading software for statistical machine translation, which takes parallel texts and makes decisions about how words and phrases ought to be translated. For Moses, a parallel text consists of two files, one original text and the other its translation, each with matching sentences on the same line number. (See here for a basic, step-by-step walkthrough of creating and using a minimal translation model.) The first step of this project would be assembling a parallel corpus of the original language and a modern day translation. (There are many sources for texts and translations, with Perseus being a good starting place for Greek, Latin, Arabic, and Old Norse. See here one how to add a corpus to the CLTK organization.) Once a corpus has been made, experimentation with Moses parameters can begin. The CLTK core needs a wrapper to Moses's command line interface (perhaps with something as simple as subprocess.run()) which allows for the selection of arguments. Of course, once a good model has been made, it will be added to the models repository for its language (https://github.com/cltk/{language}_models_cltk, e.g., for Latin and a Python wrapper for its use added to the CLTK. The entire goal of such a project is to dramatically lower the bar for researchers and translators of ancient, heretofore untranslated texts.

Convert Coptic Scriptorium to Python. The Coptic Scriptorium is a website for doing NLP with Coptic. the Perl code has been rewritten into Python for its web API. Someone may bring this code into the CLTK. Most of its code is written in Perl, however, so this task entails re-writing its algorithms in Python according to the CLTK's API. Repos for rewriting include POS tagger and lemmatizer, orthography normalizer, and tokenizer. You'll likely need to contact the author of this code for help, and you will need at least a beginner's understanding of Coptic.

Completed

New frontend functionality.
The CLTK's in–development website, the Classical Language Archive, and API are an open platform onto which we can add new languages, texts, and NLP functionality. See here for its latest iteration. The website aims to to be a beautiful reading environment that is also useful for both teaching and scholarship of classical languages. The API aims to bring all of the CLTK's core functionality to the web and act as a text server for CLTK corpora. We are currently partnering with classical language educators and publications experts to assist in determining the best platform we can create for classrooms, researchers, and general public users. We welcome projects which offer creative ideas of how to bring to life, including but not limited to the following list:

Converting all corpora managed by the CLTK to our standard JSON format

Developing a system to allow a teacher to create lists of annotations to share with a class studying a work in the CLTK Archive

Integrating current texts in the CLTK Archive with 3rd party resources such as Google Scholar, JSTOR, VIAF, or DBpedia

Joining a translation of a work with the original text semantically: we currently have the ability to show a translation of a given work in the CLTK Archive next to the work in the original language, but for the next round of updates, we need to approximate as much as possible which paragraphs/lines from the translation match with which paragraphs/lines from the original.

Incorporating and improving other linked-data related to the texts managed by the CLTK Archive, including hyperlinking cross-references to other passages, discovery of related passages using the CLTK text reuse module, lookups for dictionary definitions in languages other than latin, information retrieval, knowledge graphs, tagging part-of-speech and syntax parsing, and map and timeline integration

Get Latin macronizer to compile, add to prosody module. This work could be close to done, just needs someone versed in C to get the legacy C application to build with Clang (Mac) and GCC (Linux). See this ticket for details about making it work. The macronizer is not a goal in and of itself, but putting it in a pipeline which macronizes incoming strings and feeds its output into the prosody scanner (docs here). As with the "Improve part–of–speech tagger" task (below), this code will need to be automatically importable, build-able, and runnable all from within the CLTK, as has been done with the TLGU. Other POS–based approaches are welcome for this, too. It's conceivable that the original macronizer should be rewritten entirely in Python, or that the macronizing logic could be built to leverage another POS tagger (such as the CLTK's).

Clean and transform documents from corpora to render to the frontend reading environment. For the CLTK API, one of the biggest challenges currently is converting the XML format preferred by some digital libraries to JSON that can be consumed by the reading environment frontend. The overall goal is to import data from an XML file that looks like this and render it this layout. The JSON files that have been transformed are available in the latin_text_perseus repo. Reference the issues on the cltk_api repo related to document conversion for information on the status of conversion.

Improve lemmatizer. Lemmatization is essential to NLP in highly inflected languages, since word statistics often need to address unique dictionary headwords, not their many permutations. However, simple rule–based stemming or lemmatizing tends to fail on the many permutations of word endings. The current CLTK lemmatizer, used for both Greek and Latin, needs to be improved for greater accuracy, especially in the areas of (a) recognizing rare forms and (b) resolving ambiguous forms. It works with a dictionary that maps possible tokens to lemmata. This approach fails in case (a), because unseen forms will always fail to match the preset dictionary. It also does not help in case (b), as ambiguous forms remain unresolved by simple mapping. Other available Latin lemmatizers, including those that can be used with a Python wrapper, such as Morpheus (via the Archimedes XML-RPC service), do not address these two issues, either. We recommend a multi–pronged approach, which leverages one or more of the following, upon failure to match: regex–based rules based on inflection, bigram/trigram statistics, and POS tagging of given sentence, all of which may be used to hint at a statistically probable match.

Make dependency grammar taggers for Greek and Latin.Dependency grammar is a method of parsing the syntax of sentences. It is especially well suited to languages with non-fixed word order. This project will allow for valuable high-level insights into author stylistics and the historical development of syntax. The three core steps involved are, basically: parse treebank data sets, run dependency grammar algorithm, and add the produced model to a language's model repository (https://github.com/cltk/{language}_models_cltk). The Stanford Parser and Malt Parser are leading software packages for training and using models. Python bindings exist, which should be leveraged (should they be proved to work as intended; otherwise, a subprocess.run() can be made to work). Once the model has been trained, an intuitive interface will need to be coded, along the lines of cltk/tag/pos.py. Annotated Greek and Latin are treebanks originally tagged by the Alpheios Project, and many other texts newly available from Perseus Treebank Data. Several other significant annotated corpora exist Index Thomisticus and PROEIL), however they may use different annotations; their tag conventions should be converted to Universal Dependency labels, if possible. Some useful dependency grammar resources here: FAQ, how to parse, an early-stage neural network parser (also from Stanford), and Universal Dependency grammar documentation (look for Ancient Greek, Ancient Greek-PROEIL, and Latin-ITT, and Latin-PROEIL). Finally, this project can be done for any Classical language with dependency grammar treebanks.

Improve part–of–speech tagger. The current CLTK POS taggers (docs here for Greek, Latin) are passably accurate for some needs (80% Greek, 70% Latin), especially seeding an annotated set (see Xenophon example here). However, the Lapos tagger, a new program/algorithm, claims to have much higher accuracy for Latin, at about 85% to 95%. Lapos is C++ code, which means whoever takes this on will need to design a hands–off workflow for installation and use (similar to what was done with the C TLGU program). For data set, the Lapos tagger should be trained on the Alpheios treebanks (Greek, Latin), however Perseus's latest treebank collection (CLTK fork here) needs to be consulted, too. Finally, consideration should be made whether the PROIEL treebanks' POS tags (CLTK fork) can be rewritten to Perseus's style (and vice versa, of course). All the models will be stored with the CLTK model repos.

Develop JavaScript frontend and Python backend to CLTK webapp: A thoughtfully-designed web-based application is the ideal medium for reading and studying Classical texts, so we are currently designing and developing an application to provide the tools necessary for both casual reading and serious study of Classical languages. Opportunities are available to have significant impact on this forthcoming CLTK website. The frontend under development is a reading interface which uses the Meteor framework with modular frontend components in React. Tasks that need doing include both rewriting elements of a prototype application (segetes.io, in Angular) and developing innovative new modules. A RESTful backend API is written in Python (with the Flask framework) and currently under early development. Its goals are to serve text, metadata, and act as an interface to all the text processing capabilities of the CLTK (see here example of text serving). Project proposals may target either frontend or backend, or both.