Building a (Finnish) Part of Speech Tagger

I wanted to try a part of speech tagger (POS) to see if it could help me with some of the natural language processing (NLP) problems I had. This was in Finnish, although other languages would be nice to have supported for the future. So off I went, (naively) hoping that there would be some nicely documented, black-box, open-source, free, packages available. Preferably, I was looking for one in Java as I wanted to try using it as part of some other Java code. But other (programming) languages might work as well if possible to use as a service or something. Summary: There are a bunch of cool libs out there, just need to learn POS tagging and some more NLP terms to train them first…

I remembered all the stuffs on ParseMcParseFace, Syntaxnet and all those hyped Google things. It even advertises achieving 95% accuracy on Finnish POS tagging . How cool would that be. And its all about deep learning, Tensorflow, Google Engineers and all the other greatest and coolest stuff out there, right? OK, so all I need to do is go to their Github site , run some 10 steps of installing various random sounding packages, mess up my OS configs with various Python versions, settings, and all the other stuff that makes Python so great (OK lets not get upset, its a great programming language for stuffs :)). Then I just need check out the Syntaxnet git repo, run a build script for an hour or so, set up all sorts of weird stuff, and forget about a clean/clear API. OK, I pass, after messing with it too long.

So. After trying that mess, I Googled, Googled, Duckducked, and some more for some alternatives better suited for me. OpenNLP seemed nice as it is an Apache project, which have generally worked fine for me. There are a number of different models for it at SourceForge . Some of them are even POS tagger models. Many nice languages there. But no Finnish. Now, there is an option to train your own model . Which seems to require some oddly formatted, pre-tagged text sets to train. I guess that just means POS tagging is generally seen as a supervised learning problem. Which is fine, it’s just that if you are not deep in the NLP/POS tagging community, these syntaxes do look a bit odd. And I just wanted a working POS tagger, not a problem of trying to figure out what all these weird syntaxes are, or a problem of going to set up a project on Mechanical Turk or whatever to get some tagged sentences in various languages.

What else? There is a nice looking POS tagger from Stanford NLP group. It also comes with out-of-the-box models for a few languages. Again, no Finnish there either but a few European ones. Promising. After downloading it, I managed to get it to POS tag some English sentences and even do lemmatization for me (finding the dictionary base form of the word, if I interpret that term correctly). Cool, certainly useful for any future parsing and and other NLP tasks for English. They also provide some instructions for training it for new languages.

This training again requires the same pre-annotated set of training data with POS tagging. Seeing some pattern here.. See, even I can figure it out sometime. So there is actually a post on the internets, where someone describes building a Swedish POS tagger using the Stanford tagger. And another one instructing people (in comments) to downloaded the tagger code and read it to understand how to configure it. OK, not going to do that. I just wanted a POS tagger, not an excursion into some large code base to figure out some random looking parameters that require a degree in NLP to understand them. But hey, Sweden is right next to Finland, maybe I can try the configuration used for it to train my own Finnish POS tagger? What a leap of logic I have there..

I downloaded the Swedish .props file for the Stanford tagger, and now just needed the data. Which, BTW, I needed also for all the others, so I might as well have gone with the OpenNLP as well and tried that, but who would remember that anymore at this point.. The Swedish tagger post mentioned using some form of Swedish TreeBank data. So is there a similar form of Finnish TreeBank? I remember hearing that term. Sure there is. So downloaded that. Unpack the 600MB zip to get a 3.8GB text file for training. The ftb3.1.conllx file. Too large to open in most text editors. More/less to the rescue.

But hey, this is sort of like big data, which this should be all about, right? Maybe the Swedish .props file just works with it, after all, both are Treebanks (whatever that means)? The Swedish Treebank site mentions having a specific version for the Stanford parser built by some Swedish treebank visitor at Googleplex. Not so for Finnish.

Just try it. Of course the Swedish .props file wont work with the Finnish TreeBank data. So I build a Python script to parse it and format it more like the Swedish version. Words one per line, sentences separated with linefeeds. The tags seem to differ across various files around but I have no idea about how to map them over so I just leave them and hope the Stanford people have it covered. (Looking at it later, I believe they all treat it as a supervised learning problem with whatever target tags you give.)

Tried the transformed file with the Stanford POS tagger. My Python script tells me the file has about 4.4 million sentences, with about 76 million words or something like that. I give the tagger JVM 32GB memory and see if it can handle it. No. Out of memory error. Oh dear. It’s all I had. After a few minor modifications in the .props file, and I make the training data set smaller until finally at 1M sentences the tagger finishes training.

Meaning the program runs through and prints nothing (no errors but nothing else either). There is a model file generated I can use for tagging. But I have no idea if this is any good or not, or how badly did I just train it. Most of the training parameters have a one-line description in the Javadoc, which isn’t hugely helpful (for me). Somehow I am not too confident I managed to do it too well. Later as I did various splits on the FinnTreeBank data for my customized Java tagger and the OpenNLP tagger, I also tried this one with the 1.4M sentence test set. Got about 82% accuracy, which seems pretty poor considering everything else I talk about in the following. So I am guessing my configuration must have been really off since otherwise people have reported very good results with it. Oh well, maybe someone can throw me a better config file?

This is what running the Stanford tagger on the 1M sentence set looked like on my resource graphs:

So it mostly runs on a single core and uses about 20GB of RAM for the 1M sentence file. But obviously I did not get it to give me good results, so what other options do I have?

During my Googling and stuff I also ran into a post describing writing a custom POS tagger in 200 lines of Python. Sounds great, even I should be able to get 200 lines of Python, right? I translated that to Java to try it out on my data. Maybe I will call my port “LittlePOS”. Make of that what you will :). At least now I can finally figure out what the input to it should be and how to provide it, since I wrote (or translated) the code, eh?

Just to quickly recap what (I think) this does.

Normalize all words = lowercase words, change year numbers to “!YEAR” and other numbers to “!DIGIT”.

Collect statistics for each word, how often different POS tags appear for each word. A threshold of 97% is used to mark a word as “unambiguous”, meaning it can always be tagged with a specific tag if it has that tag 97% or more times in the training data. The word also needs to occur some minimum number of times (here it was 20).

Build a set of features for each POS tag. These are used for the “machine learning” part to learn to identify the POS tag for a word. In this case the features used were:

Suffix of word being tagged. So its last 3 letters in this case.

Prefix of word being tagged. Its first letter in this case.

Previous tag. The tag assigned to previous word in sentence.

2nd previous tag. The tag assigned to the previous word to the previous word :).

Combination of the previous and previous-previous tags. So previous tag-pair.

The word being tagged itself.

Previous tag and current-word pair.

Previous word in sentence.

Suffix of previous word, its 3 last letters.

Previous-previous word. So back two spots in the sentence where we are tagging.

Next word in sentence.

Suffix of next word. Its 3 last letters.

Next-next word in sentence. So the next word after the next word. To account for the start and end of a sentence, the sentence word array is always initialized with START1, START2 and END1, END2 “synthetic words”. So these features also work even if there is no real previous or next word in the sentence. Also, word can be anything, including punctuation marks.

Each of the features is given a weight. This is used to calculate prediction of what POS tag a word should get based on its features in the sentence.

If, in training, a word is given (predicted) a wrong tag based on its features, the weights of those features for the wrong tag are reduced by 1 each, and the weights for those features for the correct tag are increase by 1 each.

If the tag was correctly predicted, the weights stay the same.

Getting this basic idea also helps me understand the other parsers and their parameters a bit better. I think this is what is defined by the “arch” parameter in the Stanford tagger props file, and would maybe need a better fix? I believe this setting of parameters must be one of the parts of POS tagging with the most diverse sets of possibilities as well.. Back to the Stanford tagger. It also seemed a bit slow at 50ms average tagging time per sentence, compared to the other ones I discuss in the following. Not sure what I did wrong there. But back to my Python to Java porting.

I updated my Python parser for the FinnTreeBank to produce just a file with the word and POS tag extracted and fed that LittlePOS. This still ran out of memory on the 4.4M sentences with 32GB JVM heap. But not in the training phase, only when I finally tried to save the model as a Protocol Buffers binary file. The model in memory seems to get pretty big, so I guess the protobuf generator also ran out of resources when trying to build 600MB file with all the memory allocated for the tagger training data.

In the resources graph this is what it looks like for the full 4.4M sentences:

The part on the right where the “system load” is higher and the “CPU” part looks to bounce wildly is where the protobuf is being generated. The part on the left before that is the part where the actual POS tagger training takes place. So the protobuf generation actually was running pretty long, my guess is the JVM memory was low and way too much garbage collection etc. is happening. Maybe it would have finished after few more hours but I called it a no-go and stopped it.

3M sentences finishes training fine. I use the remaining 1.4M for testing the accuracy. Meaning I use the trained tagger to predict tags for those 1.4M sentences and count how many words it tagged right in all of those. This gives me about 96.1% accuracy on using the trained tagger. Aawesome, now I have a working tagger??

The resulting model for the 3M sentence training set, when saved as a protobuf binary, is about 600MB. Seems rather large. Probably why it was failing to write it with the full 4.4M sentences. A smaller size model might be useful to make it more usable in a smaller cloud VM or something (I am poor, and cloud is expensive for bigger resources..). So I tried to train it on sentences of size 100k to 1M on 100k increments. And on 1M and 2M sentences. Results for LittlePOS are shown in the table below:

Sentences

Words correct

Accuracy

PB Size

Time/1

100k

21988662

88.7%

90MB

4.5ms

200k

22490881

90.7%

153MB

4.1ms

300k

22608641

91.2%

195MB

3.9ms

400k

22779163

91.9%

233MB

3.8ms

500k

22911452

92.4%

268MB

3.7ms

600k

23033403

92.9%

304MB

3.5ms

700k

23095784

93.1%

337MB

3.7ms

800k

23149286

93.4%

366MB

3.5ms

900k

23169125

93.4%

390MB

3.2ms

1M

23167721

93.4%

378MB

3.3ms

2M

23520297

94.8%

651MB

3.0ms

3M

23843609

96.2%

890MB

2.0ms

1M_2

23105112

93.2%

467MB

ms

3M_0a

20859104

84.1%

651MB

1.7ms

3M_0b

22493702

90.7%

651MB

1.7ms

Here

Sentences is the number of sentences in the dataset.

Correct is the number of words correctly predicted. The total number of words is always 24798043 as all tests were run against the last 1.4M sentences (ones left over after taking the 3M training set).

Accuracy is the percentage of all predictions that it got right.

PB Size is the size of the model as a Protocol Buffers binary after saving to disk.

Time/1 is the time the tagger took on average to tag a sentence.

The line with 1M_2 shows an updated case, where I changed the training algorithm to run for 50 iterations instead of the 10 it had been set to in the Python script. Why 50? Because the Stanford and OpenNLP seem to use a default of 100 iterations and I wanted to see what difference it makes to increase the iteration count. Why not 100? Because I started it with training the 3M model for 100 iterations and looking at it, I calculated it would take a few days to run. The others were much faster so plenty of room for optimization there. I just ran it for 1M sentences and 50 iterations then, as that gives an indication of improvement just as well.

So, the improvement seems pretty much zero. In fact, the accuracy seems to have gone slightly down. Oh well. I am sure I did something wrong again. It is possible also to take the number of correctly predicted tags from the added iterations during training. The figure below illustrates this:

This figure shows how much of the training set the tagger got right during the training iterations. So maybe the improvement in later iterations is not that big due to the scale but it is still improving. Unfortunately, in this case, this did not seem to have a positive impact on the test set. There are also a few other points of interest in the table.

Back to the results table. The line with 3M_0a shows a case where all the features were ignored. That is, only the “unambiguous” ones were tagged there. This already gives the result of 84.1%. The most frequent tag in the remaining untagged ones is “noun”. So tagging all the remaining 15.9% as nouns gives the score in 3M_0b. In other words, if you take all the words that seem to clearly only have one tag given for them, given them that tag, and tag all the remaining ones as nouns, you get about 90.7% accuracy. I guess that would be the reference to compare against.. This score is without any fancy machine learning stuffs. Looking at this, the low score I got for training the Stanford POS tagger was really bad and I really need that for dummies guide to properly configure it.

But wait, now that I have some tagged input data and Python scripts to transform it into different formats, I could maybe just modify these scripts to give me OpenNLP compliant input data? Brilliant, lets try that. At least OpenNLP has default parameters and seems more suited for dummies like me. So on to transform my FinnTreeBank data to OpenNLP input format and run my experiments. Python script. Results below.

Sentences

Words correct

Accuracy

PB Size

Time/1

100k

22247182

89.7%

4.5MB

7.5ms

200k

22680369

91.5%

7.8MB

7.6ms

300k

22861728

92.2%

10.4MB

7.7ms

400k

22994242

92.7%

12.8MB

7.8ms

500k

23114140

93.2%

14.8MB

7.8ms

600k

23199457

93.6%

17.1MB

7.9ms

700k

23235264

93.7%

19.2MB

7.9ms

800k

23298257

94.0%

21.1MB

7.9ms

900k

23324804

94.1%

22.8MB

7.9ms

1M

23398837

94.4%

24.5MB

8.0ms

2M

23764711

95.8%

39.9MB

8.0ms

3M

24337552

98.1%

55.9MB

8.1ms

(4M)

24528432

98.9%

69MB

9.6ms

4M_2

6959169

98.5%

69MB

9.7ms

(4.4M)

24567908

99.1%

73.5MB

9.6ms

There are some special cases here:

(4M): This mixed training and test data in training with the first 4M of the 4.4M sentences, and then taking the last 1.4M of the 4.4M for testing. I believe in machine learning you are not supposed to test with the training data or the results will seem too good and not indicate any real world performance. Had to do it anyway, didn’t I 🙂

(4.4): This one used the full 4.4M sentences to train and then tested on the subset 1.4M of the same set. So its a broken test again by mixing training data and test data.

4M_2: For the evaluation, this one used the remaining number of sentences after taking out the 4M training sentences. So since the total is about 4.4M, which is actually more like 4.36M, the test set here was only about 360k sentences as opposed to the other where it was 1.4M or 1.36M to be more accurate. But it is not mixing training and test data any more. Which is probably why it is slightly lower. But still an improvement so might as well train on the whole set at the end. The number of test tags here is 7066894 as opposed to the 24798043 in the 1.4M sentence test set.

And the resource use for training at 4M file size:

So my 32GB of RAM is plenty, and as usual it is a single core implementation..

Next I should maybe look at putting this up as some service to call over the network. Some of these taggers actually already have support for it but anyway..

A few more points I collected on the way:

For the bigger datasets it is obviously easy to run out of memory. Looking at the code for the custom tagger trainer and the full 4.4M sentence training data, I figure I could scale this pretty high in terms of sentences processed by just storing the sentences into a document database and not in memory all at once. ElasticSearch would probably do just fine as I’ve been using it for other stuff as well. Then read the sentences from the database into memory as needed. The main reason the algorithm seems to need to keep the sentences in memory is to shuffle them randomly around for new training iterations. I could just shuffle the index numbers for sentences stored in the DB and read some smaller batches for training into memory. But I guess I am fine with my tagger for now. Similarly, the algorithm uses just a single core in training for now, but could be parallelized to process each sentence separately quite easily, making it “trivially parallel”. Would need to test the impact on accuracy though. Memory use could probably go lower using various optimizations, such as hashing the keys. Probably for both CPU and memory plenty of optimizations are possibly, but maybe I will just use OpenNLP and let someone else worry about it :).

From the results of the different runs, there seems to be some consistency in LittlePOS running faster on bigger datasets, and the OpenNLP slightly slower. The Stanford tagger seems to be quite a bit slower at 50ms, but could be again due to configuration or some other issues. OpenNLP seems to get a better accuracy than my LittlePOS, and the model files are smaller. So the tradeoff in this case would be model size vs tagging speed. The tagging speed being faster with bigger datasets seems a bit odd but maybe more of the words become “unambigous” and thus can be handled with a simple lookup on a map?

Finally, in the hopes of trying the stuff out on a completely different dataset, I tried to download the Finnish datasets for Universal Dependencies and test against those. I got this idea as the Syntaxnet stats showed using these as the test and training sets. Figured maybe it would give better results across sets taken from different sources. Unfortunately Universal Dependencies had different tag sets from the FinnTreeBank I used for training, and I ran out of motivation trying to map them together. Oh well, I just needed a POS tagger and I believe I now know enough on the topic and have a good enough starting point to look at the next steps..

But enough about that. Next, I think I will look at some more items in my NLP pipeline. Get back to that later…

And add a line with “Threads=4”, it then will use 4 threads for the training phase.

In case you want to try to train on Universial Dependency 2.0, Apache OpenNLP has built in support to train on it.

For example the following can be done to do testing and eval:
bin/opennlp POSTaggerTrainer.conllu -lang fin -tagset u -model fin-pos-ud-u.bin -data UniversalDependencies20/ud-treebanks-v2.0/UD_Finnish/fi-ud-train.conllu