Jitar

A simple Trigram HMM part-of-speech tagger

Introduction

Jitar is a simple part-of-speech tagger, based on a trigram Hidden
Markov Model (HMM). It (partly) implements the ideas set forth in
[1]. Jitar is written in Java, so it should be easy to use in other
Java programs, or languages that run on the JVM.

Warning

The Jitar API will be highly unstable for the first few versions!

Download

The latest Jitar version can be downloaded from the
releases page. The
binary distribution includes a couple of handy scripts to use
Jitar.

If you would like to use Jitar in your own software, add it as
a dependency.

Training

A model can be created from a corpus that includes part of speech
tags, such as the Brown corpus. The model can be created easily with
the training program:

bin/train brown my_brown_corpus my_corpus.model

Replace brown by conll if you are using a corpus in CoNLL format.

Tagging

Usually, you will want to call the tagger from your own program, but
we have included a simple command line tagger as a sample. This
tagger reads pretokenized sentences from the standard input (one
sentence per line), and will print the best scoring tag sequence to
the standard output. For example:

$ echo "The cat is on the mat ." | bin/tag model
AT NN BEZ IN AT NN .

Release plan

For version 0.y.z, there might be API breakage. The plan is to offer
API stability for a given x in x.y.z when x >= 1.

0.4.0 (Planned)

Use Dictomaton to store the lexicon and suffixes for unknown words.

Compute interpolated scores only once.

0.3.0

Add a capitalization marking to tags (as per the TnT paper). This gives
and improvement of around .2% on German and English.

Add a separate unknown word distribution for words containing a dash.
This provides a modest improvement for English and German.

API simplification (no more need to use/specify start and end markers).