Open Source proof-reading tool

Add a new page

Note: this is an archived page

We'd like to to change the way LT handles part-of-speech tags (POS tags). Why? As a reminder, this is what the POS tags look like for English:

DT Determiner
NN Noun, singular or mass
NN:U Mass noun

Thus you need regular expressions to express a pattern like "a determiner or a noun". It will look like this: DT|NN.* - this is a problem for several reasons:

rule contributors need to know the tag names

rule contributors need to know regular expressions

we're using regular expressions where it's not really needed

The idea is to use more verbose names, maybe like this:

instead of DET, use: pos=determiner

instead of NNS use: pos=noun, number=plural

This way we don't need regular expressions (except a way to express 'or'), and these tag names could be used in XML as well as in the new online rule editor.

Technical Implementation

We could either switch to the new POS tags completely, i.e. modify the dictionaries to contain them, or we could introduce a mapping/interpretation so that the dictionary information gets translated to the new tags after lookup. The latter seems more promising because:

no need to touch the binary dictionaries

the binary dictionaries use a compact representation instead of a verbose one, which helps keeping them compact (not sure how much of a difference this makes)

we can migrate slowly, i.e. the old way of addressing tags keeps working (probably forever)

The drawback of a mapping/interpretation is that it requires some processing for each lookup, e.g. a lookup in a hash map. It only needs to be done once per token though, so this shouldn't be a problem.

Open questions

How exactly should the new POS tags look like?

Answer: it depends on the language, but tag that will be shared by many languages are: pos, person, case, number, gender, tense etc.