Corpus based part-of-speech tagging

Abstract

In natural language processing, a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with POS labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based: a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabeled data. Presented here is a brief state-of-the-art account on POS tagging. POS tagging approaches make use of labeled corpus to train computational trained models. Several typical models of three kings of tagging are introduced in this article: rule-based tagging, statistical approaches and evolution algorithms. The advantages and the pitfalls of each typical tagging are discussed and analyzed. Some rule-based and stochastic methods have been successfully achieved accuracies of 93–96 %, while that of some evolution algorithms are about 96–97 %.

Brill, E. (1992). A simple rule-based part of speech tagger. In
Proceedings of the third conference on applied computational linguistics (pp. 112–116). Trento: Association for Computational Linguistics.

Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In
Proceedings of the 4th international conference on language resources and evaluation (
LREC’04), Citeseer.

Ngai, G., & Florian, R. (2001). Transformation-based learning in the fast lane. In
Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies (pp. 1–8).