X Summer School

Creation of Ukrainian language NER system

Abstract:

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

NER is one of the popular NLP tasks, and the challenge of creating a robust NER system lies in access to a substantial large corpus of annotated data. However, such data is not available for all languages, specifically for the Ukrainian one, but there’s a potential to use unsupervised and semi-supervised approaches.

We will use the unannotated Ukrainian language corpus (https://github.com/mariana-scorp/lt-project) as a starting point and will need to dvelop some of our own data-sets/annotations, as well as try to adapt one of the existing NER algorithms or come up with our own variation.