Abstract

Recently, we have observed a rapid progress in the state of Part of Speech tagging for Polish. Thanks to PolEval—a shared task organized in late 2017—many new approaches to this problem have been proposed. New deep learning paradigms have helped to narrow the gap between the accuracy of POS tagging methods for Polish and for English. Still, the number of errors made by the taggers on large corpora is very high, as even the currently best performing tagger reaches an accuracy of ca. 94.5%, which translates to millions of errors in a billion-word corpus.

To further improve the accuracy of Polish POS tagging we propose to employ a meta-learning approach on top of several existing taggers. This meta-learning approach is inspired by the fact that the taggers, while often similar in terms of accuracy, make different errors, which leads to a conclusion that some of the methods are better in specific contexts than the others. We thus train a machine learning method that captures the relationship between a particular tagger accuracy and language context and in this way create a model, which makes a selection between several taggers in each context to maximize the expected tagging accuracy.