Tag Archives: nlp

Previously I wrote about a few experiments I ran with topic-modelling. I briefly glossed over having some results for a set of Finnish text as an example of a smaller dataset. This is a bit deeper look into that..

I use two datasets, the Finnish wikipedia dump, and the city of Oulu board minutes. Same ones I used before. Previously I covered topic modelling more generally, so I won’t go into too much detail here. To summarize, topic modelling algorithms (of which LDA or Latent Dirilect Allocation is used here) find sets of words with different distributions over sets of documents. These are then called the “topics” discussed in those documents.

This post looks at how to use topic models for a different language (besides English) and what could one maybe do with the results.

Lemmatize (turn words into baseforms before use) or not? I choose to lemmatize for topic modelling. This seems to be the general consensus when looking up info on topic modelling, and in my experience it just gives better results as the same word appears only once. I covered POS tagging previously, and I believe it would be useful to apply here as well, but I don’t. Mostly because it is not needed to test these concepts, and I find the results are good enough without adding POS tagging to the mix (which has its issues as I discussed before). Simplicity is nice.

I used the Python Gensim package for building the topic models. As input, I used the Finnish Wikipedia text and the city of Oulu board minutes texts. I used my existing text extractor and lemmatizer for these (to get the raw text out of the HTML pages and PDF docs, and to baseform them, as discussed in my previous posts). I dumped the lemmatized raw text into files using slight modifications of my previous Java code and the read the docs from those files as input to Gensim in a Python script.

I started with the Finnish Wikipedia dump, using Gensim to provide 50 topics, with 1 pass over the corpus. First 10 topics that I got:

The format of the topic list I used here is “topicX=word1[count] word2[count]”, where X is the number of the topic, word1 is the first word in the topic, word2 the second, and so on. The [count] is how many times the word was associated with the topic in different documents. Consider it the strength, weight, or whatever of the word in the topic.

topic3 = another Finnish language reated topic. Odd one here is “kasvi” = plant. Generally this seems to be more related to words and their forms, where as topic1 maybe more about structure and relations.

topic5 = estonia related

Overall, I think this would improve given more passes over the corpus to train the model. This would give the algorithm more time and data to refine the model. I only ran it with one pass here since the training for more topics and with more passes started taking days and I did not have the resources to go there.

My guess is also that with more data and more broader concepts (Wikipedia covering pretty much every topic there is..) you would also need more topics that the 50 I used here. However, I had to limit the size due to time and resource constraints. Gensim probably also has more advanced tuning options (e..g, parallel runs) that would benefit the speed. So I tried a few more sizes and passes with the smaller Oulu city board dataset, as it was faster to run.

Some topics for the city of Oulu board minutes, run for 20 topics and 20 passes over the training data:

The word “oulu” repeats in most of the topics. This is quite natural as all the documents are from the board of the city of Oulu. Depending on the use case for the topics, it might be useful to add this word to the list of words to be removed in the pre-cleaning phase for the documents before running the topic modelling algorithm. Or it might be useful information, along with the weight of the word inside the topic. Depends.

topic2 = School related. For example, “koulu” = school, “tukea” = support, … Sharing again common words such as “kaupunki” = city, which may also be considered for removal or not depending on the case.

In general quite good and focused topics here, so I think in general quite a good result. Some exceptions to consider:

topic10 = mostly garbage related to HTML formatting and website link structures. still a real topic of course, so nicely identified.. I think something to consider to add to the cleaning list for pre-processing.

topic12 = Seems related to some city finance related consultation (perlacon seems to be such as company) and associated event (the forum). With a bunch of meeting dates.

topic13 = unclear garbage

So in general, I guess reasonably good results but in real applications, several iterations of fine-tuning the words, the topic modelling algorithm parameters, etc. based on the results would be very useful.

So that was the city minutes topics for a smaller set of topics and more passes. What does it look for 100 topics, and how does the number of passes over the corpus affect the larger size? more passes should give the algorithm more time to refine the topics, but smaller datasets might not have so many good topics..

Without going too much into translating every word, I would say these results are too spread out, so from this, for this dataset, it seems a smaller set of topics would do better. This also seems to be visible in the word counts/strengths in the [square brackets]. The topics with small weights also seem pretty poor topics, while the ones with bigger weights look better (just my opinion of course :)). Maybe something to consider when trying to explore the number of topics etc.

And the same run, this time with 20 passes over the corpus (100 topics and 10 first ones shown):

Even the smaller topics here seem much better now with the increase in passes over the corpus. So perhaps the real difference just comes from having enough passes over the data, giving the algorithms more time and data to refine the models. At least I would not try without multiple passes based on comparing the results here of 1 vs 20 passes.

For example, topic2 here has small numbers but still all items seem related to grey market economy. Similarly, topic7 has small numbers but the words are mostly related to arts and culture.

So to summarize, it seems lemmatizing your words, exploring your parameters, and ensuring to have a decent amount of data and decent number of passes for the algorithm are all good points. And properly cleaning your data, and iterating over the process many times to get these right (well, as “right”as you can).

To answer my “research questions” from the beginning: topic modelling for different languages and use cases for topic modelling.

First, lemmatize all your data (I prefer it over stemming but it can be more resource intensive). Clean all your data from the typical stopwords for your language, but also for your dataset and domain. Run the models and analysis several times, and keep refining your list of removed words to clean also based on your use case, your dataset and your domain. Also likely need to consider domain specific lemmatization rules as I already discussed with POS tagging.

Secondly, what use cases did I find looking at topic modelling use cases online? Actually, it seems really hard to find concrete actual reports of uses for topic models. Quora has usually been promising but not so much this time. So I looked at reports in the published research papers instead, trying to see if any companies were involved as well.

Some potential use cases from research papers:

Bug localization, as in finding locations of bugs in source code is investigated here. Source code (comments, source code identifiers, etc) is modelled as topics, which are mapped to a query created from a bug report.

Matching duplicates of documents in here. Topic distributions over bug reports are used to suggest duplicate bug reports. Not exact duplicates but describing the same bug. If the topic distributions are close, flag them as potentially discussing the same “topic” (bug).

Ericsson has used topic models to map incoming bug reports to specific components. To make resolving bugs easier and faster by automatically assigning them to (correct) teams for resolution. Large historical datasets of bug reports and their assignments to components are used to learn the topic models. Topic distributions of incoming bug reports are used to give probability rankings for the bug report describing a specific component, in comparison to topic distributions of previous bug reports for that component. Topic distributions are also used as explanatory data to present to the expert looking at the classification results. Later, different approaches are reported at Ericsson as well. So just to remind that topic models are not the answer to everything, even if useful components and worth a try in places.

In cyber security, this uses topic models to describe users activity as distributions over the different topics. Learn topic models from user activity logs, describe each users typical activity as a topic distribution. If a log entry (e.g., session?) diverges too much from this topic distribution for the user, flag it as an anomaly to investigate. I would expect simpler things could work for this as well, but as input for anomaly detection, an interesting thought.

Tweet analysis is popular in NLP. This is an example of high-level tweet topic classification: Politics, sports, science, … Useful input for recommendations etc., I am sure. A more targeted domain specific example is of using topics in Typhoon related tweet analysis and classification: Worried, damage, food, rescue operations, flood, … useful input for situation awareness, I would expect. As far as I understood, topic models were generated, labeled, and then users (or tweets) assigned to the (high-level) topics by topic distributions. Tweets are very small documents, so that is something to consider, as discussed in those papers.

Use of topics models in biomedicine for text analysis. To find patterns (topic distributions) in papers discussing specific genes, for example. Could work more broadly as one tool to explore research in an area, to find clusters of concepts in broad sets of research papers on a specific “topic” (here a research on a specific gene). Of course, there likely exist number of other techniques to investigate for that as well, but topic models could have potential.

Generally labelling and categorizing large number of historical/archival documents to assist users in search. Build topic models, have experts review them, and give the topics labels. Then label your documents based on their topic distributions.

Bit further outside the box, split songs into segments based on their acoustic properties, and use topic modelling to identify different categories/types of music in large song databases. Then explore the popularity of such categories/types over time based on topic distributions over time. So the segments are your words, and the songs are your documents.

Finding image duplicates of images in large data sets. Use image features as words, and images as documents. Build topic models from all the images, and find similar types of images by their topic distributions. Features could be edges, or even abstract ones such as those learned by something like a convolutional neural nets. Assists in image search I guess..

Most of these uses seem to be various types of search assistance, with a few odd ones thinking outside the box. With a decent understanding, and some exploration, I think topic models can be useful in many places. The academics would sayd “dude XYZ would work just as well”. Sure, but if it does the job for me, and is simple and easy to apply..

To get a better view of the popular Word2Vec algorithm and its applications in different contexts, I ran experiments on Finnish language and Word2vec. Let’s see.

I used two datasets. First one is the traditional Wikipedia dump. I got the Wikipedia dump for the Finnish version from October 20th. Because I ran the first experiments around that time. The seconds dataset was the Board minutes for the City of Oulu for the past few years.

After running my clearning code on the Wikipedia dump it reported 600783 sentences and 6778245 words for the cleaned dump. Cleaning here refers to removing all the extra formatting, HTML tagging, etc. Sentences were tokenized using Voikko. For the Board minutes the similar metrics were 4582 documents, 358711 sentences, and 986523 words. Most interesting, yes?

For running Word2vec I used the Deeplearning4J implementation. You can find the example code I used on Github.

Again I have this question of whether to use lemmatization or not. Do I run the algorithm on baseformed words or just unprocessed words in different forms?

Some prefer to run it after lemmatization, while generally the articles on word2vec say nothing on the topic but rather seem to run it on raw text. This description of a similar algorithm actually shows and example of mapping “frog” to “frogs”, further indicating use of raw text. I guess if you have really lots of data and a language that does not have a huge number of forms for different words that makes more sense. Or if you find relations between forms of words more interesting.

For me, Finnish has so many forms of words (morphologies or whatever they should be called?) and generally I don’t expect to run with hundreds of billions of words of data, so I tried both ways (with and without lemmatization) to see. With my limited data and the properties of the Finnish language I would just go with lemmatization really, but it is always interesting to try and see.

Some results for my experiments:

Wikipedia without lemmatization, looking for the closest words to “auto”, which is Finnish for “car”. Top 10 results along with similarity score:

auto vs kuorma = 0.6297630071640015

auto vs akselin = 0.5929439067840576

auto vs auton = 0.5811734199523926

auto vs bussi = 0.5807990431785583

auto vs rekka = 0.578578531742096

auto vs linja = 0.5748337507247925

auto vs työ = 0.562477171421051

auto vs autonkuljettaja = 0.5613142848014832

auto vs rekkajono = 0.5595266222953796

auto vs moottorin = 0.5471497774124146

Words from above translated:

kuorma = load

akselin = axle’s

auton = car’s

bussi = bus

rekka = truck

linja = line

työ = work

autonkuljettaja = car driver

rekkajono = truck queue

moottorin = engine’s

A similarity score of 1 would mean a perfect match, and 0 a perfect mismatch. Word2vec builds a model representing position of words in “vector-space”. This is inferred from “word-embeddings”. This sounds fancy, and as usual, it is difficult to find a simple explanation of what is done. I view it a taking typically 100-300 numbers to represent each numbers relation in the “word-space”. These get adjusted by the algorithm as it goes through all the sentences and records each words relation to other words in those sentences. Probably all wrong in that explanation but until someone gives a better one..

To preprocess the documents for word2vec, I split the documents to sentences to give the words a more meaningful context (a sentence vs just any surrounding words). There are other similar techniques, such as Glove, that may work better with more global “context” than a sentence. But anyway this time I was playing with Word2vec, which I think is also interesting for many things. It also has lots of implementations and popularity.

Looking at the results above, there is the word “auton”, translating to “car’s”. Finnish language has a a large number of forms that different words can take. So, sometimes, it may be good to lemmatize to see what the meaning of the word better maps to vs matching forms of words. So I lemmatize with Voikko, the Finnish language lemmatizer again. Re-run of above, top-10:

auto vs ajoneuvo = 0.7123048901557922

auto vs juna = 0.6993820667266846

auto vs rekka = 0.6949941515922546

auto vs ajaa = 0.6905277967453003

auto vs matkustaja = 0.6886627674102783

auto vs tarkoitettu = 0.66249680519104

auto vs rakennettu = 0.6570218801498413

auto vs kuljetus = 0.6499230861663818

auto vs rakennus = 0.6315782070159912

auto vs alus = 0.6273047924041748

Meanings of the words in English:

ajoneuvo = vehicle

juna = train

rekka = truck

ajaa = drive

matkustaja = passenger

tarkoitettu = meant

rakennettu = built

kuljetus = transport

rakennus = building

alus = ship

So generally these mappings make some sense. Not sure about those building words. Some deeper exploration would probably help..

Some people also came up with the idea of POS tagging before running word2vec. Called it Sense2Vec and whatever. Just so you could better differentiate how different meanings of a word map differently. So to try to POS tag with the tagger I implemented before. Results:

auto_N vs juna_N = 0.7195479869842529

auto_N vs ajoneuvo_N = 0.6762610077857971

auto_N vs alus_N = 0.6689988970756531

auto_N vs kone_N = 0.6615594029426575

auto_N vs kuorma_N = 0.6477057933807373

auto_N vs tie_N = 0.6470917463302612

auto_N vs seinä_N = 0.6453390717506409

auto_N vs kuljettaja_N = 0.6449363827705383

auto_N vs matka_N = 0.6337422728538513

auto_N vs pää_N = 0.6313328146934509

Meanings of the words in English:

juna = train

ajoneuvo = vehicle

alus = ship

kone = machine

kuorma = load

tie = road

seinä = wall

kuljettaja = driver

matka = trip

pää = head

soo… The weirdest ones here are the wall and head parts. Perhaps again a deeper exploration would tell more. The rest seem to make some sense just by looking.

And to do the same for the City of Oulu Board minutes. Now looking for a specific word for the domain. The word being “serviisi”, which is the city office responsible for food production for different facilities and schools. This time lemmatization was applied for all results. Results:

serviisi vs tietotekniikka = 0.7979459762573242

serviisi vs työterveys = 0.7201094031333923

serviisi vs pelastusliikelaitos = 0.6803742051124573

serviisi vs kehittämisvisio = 0.678106427192688

serviisi vs liikel = 0.6737961769104004

serviisi vs jätehuolto = 0.6682301163673401

serviisi vs serviisin = 0.6641604900360107

serviisi vs konttori = 0.6479293704032898

serviisi vs efekto = 0.6455909013748169

serviisi vs atksla = 0.6436249017715454

because “serviisi” is a very domain specific word/name here, the general purpose Finnish lemmatization does not work for it. This is why “serviisin” is there again. To fix this, I added this and some other basic forms of the word to the list of custom spellings recognized by my lemmatizer tool. That is, using Voikko but if not found trying a lookup in a custom list. And if still not found, writing a list of all unrecognized words sorted by highest frequency first (to allow augmenting the custom list more effectively).

Results after change:

serviisi vs tietotekniikka = 0.8719592094421387

serviisi vs työterveys = 0.7782909870147705

serviisi vs johtokunta = 0.695137619972229

serviisi vs liikelaitos = 0.6921887397766113

serviisi vs 19.6.213 = 0.6853622794151306

serviisi vs tilakeskus = 0.673351526260376

serviisi vs jätehuolto = 0.6718368530273438

serviisi vs pelastusliikelaitos = 0.6589146852493286

serviisi vs oulu-koilismaan = 0.6495324969291687

serviisi vs bid=2300 = 0.6414187550544739

Or another run:

serviisi vs tietotekniikka = 0.864517867565155

serviisi vs työterveys = 0.7482070326805115

serviisi vs pelastusliikelaitos = 0.7050554156303406

serviisi vs liikelaitos = 0.6591876149177551

serviisi vs oulu-koillismaa = 0.6580390334129333

serviisi vs bid=2300 = 0.6545186638832092

serviisi vs bid=2379 = 0.6458192467689514

serviisi vs johtokunta = 0.6431671380996704

serviisi vs rakennusomaisuus = 0.6401894092559814

serviisi vs tilakeskus = 0.6375274062156677

So what are all these?

tietotekniikka = city office for ICT

työterveys = occupational health services

liikelaitos = company

johtokunta = board (of directors)

konttori = office

tilakeskus = space center

pelastusliikelaitos = emergency office

energia = energy

oulu-koilismaan = name of area surrounding the city

bid=2300 is an identier for one of the Serviisi board meeting minutes main pages.

19.6.213 seems to be a typoed date and could at least be found in one of the documents listing decisions by different city boards.

So almost all of these words that “serviisi” is found to be closest to are other city offices/companies responsible for different aspects of the city. Such as ICT, energy, office space, emergency response, of occupation health. Makes sense.

OK, so much for the experimental runs. I should summarize something about this.

The wikipedia results seem to give slightly better results in terms of the words it suggests being valid words. For the city board minutes I should probably filter more based on presence of special characters and numbers. Maybe this is the case for larger datasets vs smaller ones, where the “garbage” more easily drowns in the larger sea of data. Don’t know.

The word2vec algorithm also has a set of parameters to tune, which probably would be worth more investigation to get more optimized results for these different types of datasets. I simply used the same settings for both the city minutes and Wikipedia. Yet due to size differences, likely it would be interesting to play at least with the size of the vector space. For example, bigger datasets might benefit more from having a bigger vector space, which should enable them to express richer relations between different words. For smaller sets, a smaller space might be better. Similarly, number of processing iterations, minimum word frequencies etc should be tried a bit more. For me the goal here was to get a general idea on how this works and how to use it with Finnish datasets. For this, these experiments are enough.

If you read up on any articles of Word2Vec you will likely also see the hype on the ability to do equations such as “king – man + woman” = “queen”. These are from training on large English corpuses. It simply says that the relation of the word “queen” to word “woman” in sentences is typically the same as the relation of the word “king” to “man”. But then this is often the only or one of very few examples ever. Looking at the city minutes example here, since “serviisi” seems to map closest to all the other offices/companies of the city, what do we get if we run the arithmatic on “serviisi-liikelaitos” (so liikelaitos would be the common concept of the office/company). I got things like “city traffic”, “reduce”, “children home”, “citizen specific”, “greenhouse gas”. Not really useful. So this seems most useful as a potential tool for exploration but cannot really say which part gives useful results when. But of course, it is nice to report on the interesting abstractions it finds, not on boring fails.

I think lemmatization in these cases I showed here makes sense. I have no interest in just knowing that a singular form of a word is related to a plural form of the same word. But I guess in some use cases that could be valid. Of course, for proper lemmatization you might also wish to first do POS tagging to be able to choose the correct baseforms from all the options presented. In this case I just took the first baseform from the list Voikko gives for each word.

Tokenization could also be of more interest. Finnish language has a lot of compound words, some of which are visible in the above examples. For example, “kuorma-auto”, and “linja-auto” for the wikipedia example. Or the different “liikelaitos” combinations for the city of Oulu version. Further n-grams (combinations of words) would be useful to investigate further. For example, “energia” in the city example could easily be related to the city power company called “Oulun Energia”. Many similar examples likely can be found all over any language and domain vocabulary.

Further custom spelling would also be useful. For example, “oulu-koilismaan” above could be spelled as “oulu-koillismaan”. And it could further be baseformed with other forms of itself as “oulu-koillismaa”. Collecting these from the unrecognized words should make this relatively easy, and filtering out the low-frequency occurrences of the words.

So perhaps the most interesting question, What is this good for?

Not synonym search. Somehow over time I got the idea word2vec could give you some kind of synonums and stuffs. Clearly it is not for that but rather to identify words over similar concepts and the like.

So generally I can see it could be useful for exploring related concepts in documents. Or generally exploring datasets and building concept maps, search definitions, etc. More as an input to the human export work rather than fully automated as the results vary quite a bit.

Spotify mapping similar songs together via treating songs as words and playlists as sentences.

Someone tried it on sentiment analysis. Not really sure how useful that was as I just skimmed the article but in general I can see how it could be useful to find different types of words related to sentiments. As before, not necessarily as automated input but rather as input to an expert to build more detailed models.

Using the similarity score weights as means to find different topics. Maybe you could combine this with topic modelling and the look for diversity of topics?

Product recommendations by using products as words and sequences of purchases as sentences. Not sure how big is the meaning of purchase order but interesting idea.

Bet recommendations by modelling bets made by users as bet targets being words and sequences of bets sentences, finding similarities with other bets to recommend.

So that was mostly that. Similar tools exist for many platforms, whatever gives you the kicks. For example, Voikko has some python module on github to use and Gensim is a nice tool for many NLP processing tasks, including Word2Vec on python.

Also lots of datasets, especially for the English language, to use as pretrained word2vec models. For example, Facebooks FastText, Stanfords Glove datasets, Google news corpus from here. Anyway, some simple internet searches should turn out many such to use, which I think is useful for general purpose results. For more detailed domain specific ones training is good as I did here for the city minutes..

Many tools can also take in word vector models built with some other tool. For example, deeplearning4j mentions import of Glove models and Gensim lists support for FastText, VarEmbed and WordRank. So once you have some good idea of what such models can do and how to use them, building combinations of these is probably not too hard.

Previously I wrote about Building a Finnish POS tagger. This post is to elaborate a bit on training with OpenNLP, which I skimmed last time, put the code for it out, and do some additional tests on it.

I am again using the Finnish Treebank to get 4.4M pre-tagged sentences to train on. Start with a Python script to transform the Treebank XML into an OpenNLP suitable format. A short example of the output below, in the format OpenNLP takes as input (at least in the configuration I used). One line contains one sentence, each word with associated POS tag, word and tag separated with an underscore “_”.

Previously I described the test results using the Treebank data with a train/test split, showing reasonably good results. However, how well does it work in practice with some simple test sentences? Does it matter how the training and tagger input data is pre-processed? What do I mean by pre-processed?

Stemming and lemmatization are two basic transformations that are often used in NLP. Stemming is a process of cutting the ending of a word to get simple version that matches all different forms of the word. The result is not always a real “word”. For example, “argue”, “arguing”, “argus” could all stem to “argu”. Lemmatization on the other hand produces more “real” words (the Wikipedia link describes it as producing the dictionary base forms).

A related question that came to my mind: Does it matter if you stem/lemmatize your words you give as input to the tagger to train and test? I could not find a good answer on Google. One question on Stack Overflow about stemming vs POS tagging. And the response seems to be not to give an answer but riddles… Who would’ve guessed about the machine learning community? 😛

Well, reading the discussion and other answers on the StackOverflow page seems to suggest not to stem before POS tagging. And the wikipedia pages on stemming and lemmatization describe the difference as in Lemmatization requiring the context (the POS tag) to properly function. Which makes sense, since words can have multiple meaning depending on their context (part of speech). So therefore we should probably conclude that it is better to not stem or lemmatize before training a POS tagger (or using it I guess). But common sense never stopped us before, so lets try it.

To see for myself, I tried to train and use the tagger with some different configurations:

Tagger: Plain = Takes words in the sentence and tries to POS tag them as is. Not stemmed, not lemmatized, just as they are.

Tagger: Voikko = Takes words in the sentence, converts them to baseform (lemma?), reconstructs the sentence from the baseformed words. You can see the actual results and effect in the output column in the results table below.

Trained on: 100k = The tagger was trained on the first 100k sentences in the Finnish Treebank.

Trained on: 4M = The tagger was trained on the first 4M sentences in the Finnish Treebank.

Trained on: basecol = The tagger was trained on baseform column of the treebank.

Trained on: col1 = The tagger was trained on column 1 of the treebank, containing the unprocessed words (no baseforming or anything else).

Trained on: voikko = The tagger was trained on column 1 of the treebank, but before training all words in the sentence were baseformed using Voikko. Similar to “Tagger: Voikko” but for training data.

Input: The input sentence fed to the tagger. This was split to an array on whitespace, as the OpenNLP tagger takes an array of words for sentence as input.

Output: The output from the tagger, formatted as word_tag. Word = the word given to the tagger as input for that part of the sencence, tag = the POS tag assigned by the tagger for that word.

So the Treebank actually has a “baseform” column that is described in the Treebank docs as having the baseform of each word. However, I do not have the tool used for the Treebank to baseform the words. Maybe it was manually done by the people who also tagged the sentences. Don’t know. I use Voikko as a tool to baseform words.

I still wanted to try the use of the baseform column in the Treebank so I ran all the words (baseform col and col1) in the Treebank through Voikko to see if it would recognize them. Recorded all the misses and sorted them highest occurence count to lowest. This showed me that the Treebank has its own “oddities”. Some examples:

“merkittävä” becomes “merkittää”

“päivästä” becomes “päivänen”

“työpaikkoja” becomes “työ#paikko”

These are just a few examples of highly occurring and odd looking baseforms in the Treebank. None of these, in my opinion, map quite directly to understandable Finnish words. And Voikko provides different results (gives different baseform for “merkittävä”, “päivästä”, etc), so the two baseforming approaches would not match. I wanted results that I felt I could show to people who would understand what they meant. On the other hand, some of the words in the Treebank are quite domain-specific and valid but Voikko does not recognize them. Common Treebank examples of this include “CN-koodeihin”, “CN-koodiin”, “ETY-tyyppihyväksynnän”, “ETY-tyyppihyväksyntään”, “läsnäollessa”. Treebank has valid baseforms for these but Voikko does not recognize these specific ones.

So I just tried it with the different configuration versions above, as illustrated in the results table below:

You can find all the POS tags etc. listed and explained in the Treebank Manual. Here are most of the above:

N = Noun

V = Verb

PrfPrc = Past participle

A = Adjective

CS = Subordinating conjunction

Abbr = Abbreviation

Num = Numeral

Punct = Punctuation

Adv = Adverb

Unkwn = Unknown

Some of these (CS, PrfPrc, Adv, …) are bit more detailed than I ever want to get after leaving primary school 100 years ago. That is to say, I have no idea what they mean. Luckily I am really only interested in the POS tag as input to other algoritms so don’t really care what they are as long as they are correct and help to differentiate the words in context. Of course, with my lack of the language nuances and academic details of all those tags, I am not very good at judging the correctness of the taggings above. But a few notes anyway:

Using the baseform column from the Treebank to train the tagger and to tag unprocessed sentences (tagger “plain”): Lots of unknowns and failed taggings in general. Size of training corpus makes little difference.

Using Treebank col 1 to train and the “plain” tagger gives better results. Still it has some issues but most general cases are not too bad.

Baseforming all words in the sentence to be tagged with Voikko (tagger “Voikko”) and using col 1 to train results in about similar performance as “plain” tagger with col 1.

Tagger “Voikko” with training type “voikko” and 4M sentences seems to give the best match. It has some issues though.

Baseforming the sentence to tag with Voikko has a chicken and egg problem (as mentioned in the Wikipedia links I put high above). You can get multiple baseforms for a word, depending on what POS the word is. If you need to define this to do POS tagging, then how do you pick which one to use? For example “keksiä” in Finnish refers to “innovating” but could also mean “cookie”. Here, I just used the first baseform of a word given by Voikko, which for “keksiä” just happens to be the one for “cookie”. When the correct one in this case would be the “innovation” one..

As there are two different baseforming approaches here (Voikko and Treebank baseform col), mixing them causes worse results than using a unified baseforming approach (Voikko for both training and later tagging). So better to stick with just the same baseformer/lemmatizer for all data.

Special elements such as smileys would need to be trained separately :). Here they are just treated as punctuation.

“Jaffa” is a Finnish drink. It gets classified here correctly as N but also as numerical, punctuation, or verb. Maybe too rare a word or something? Numerical and punctuation are still odd.

Splitting with whitespace here causes issues with sentences ending in puctuation. The last words of sentences with “.”, “?”, or such, end up classified as “Punct”. Better splitting (tokenization) needed. Since punctuation is also trained on the tagger, it should not be just discarded though as I guess it can provide valuable context for the rest of the words.

Some of my test sentences I made up to be difficult to POS tag, and with very limited sentences above, this is likely not a generally representative case. For example, “Tuli tuli” can be translated as “Fire came” (intent here), “Fire fire”, “It came it came”, and probably valid taggings would also be “N V N”, “V N N”, “N N N”, “V V V”. Some of it might even be difficult for humans without broader context, although the “tulipesä” (fireplace) would likely tip people off. Similarly “voi” could also be translated as “butter” (intent here) or “could”.

Much bigger tests would be very useful to categorize what can be tagged right, what causes issues, etc.

It would also be useful to have a system available to choose whether the sentence was tagged right or not, and to retrain further the tagger with the errors. Maybe use a generator to build further examples of such errors.

So I guess the better configurations here can do a reasonable job of tagging most sentences, as illustrated by these results and the ones I listed before (the accuracy test on Treebank test/train split).

I am not so familiar with all the works, such as Google’s Parsey McParseface. Because you know, its deep learning and that is all the rage, right ? 🙂 Would be interesting to try, but the whole setup is more than I can do right now.

Better tuning of OpenNLP parameters might also help if I had more expertise on that, and the its mapping to Finnish language peculiarities. In general, I am sure I am missing plenty of magic tricks the NLP guru’s could use.

In generall, I guess it is most likely just better to train the tagger before lemmatization/baseforming as noted before.

What more can I summarize here? Not much, not further than the bullets and points above. But this may provide a useful starting point for those interested in POS tagging for Finnish. Possibly useful points for some other languages as well..

I have previously done some topic modelling using LDA (Latent Dirilech Allocation). Back then I used a nice video from some nice guy but somehow could not find the video with search engines anymore. Too bad. Implemented LDA in Java back then based on that tutorial. I learned how it works, not why it works. Still don’t quite get why the set of topics emerges from the algorithm.

Actually I found a reasonably good explanation on Quora. Well, it is a good one if you already know most of how LDA works. Eh. Also a tutorial briefly summarizing how online LDA works, which is a nice improvement, and I guess what the tools use these days.

The number of topics LDA produces is given as a parameter, and is always a bit of a puzzle for me how to pick the best number for topics. Googling for it, I found various references to using “perplexity” to choose the best number of topics. I still have not found a good “for dummies” explanation for what that really means in practice for LDA, or how to implement it. Maybe some of the libs out there will do it for me? Python seems all the rage in data science these days, because whatever. So after a few search, gensim it is.

Gensim seems to have some perplexity options and a bunch of weird formulas to apply. Is it so hard to write some simple docs and explain these things? I guess nobody pays people to do it, and doing for free would just go against the goal of making oneself important. Sort of makes sense, and applies to most OSS software I have used. Or maybe I am just bad at using stuff.

Anyway. There is also something called topic coherence in Gensim. This is supposed to be some way to evaluate the number of topics. Somehow the explanation does not work for me. I did not quite grasp how it works for real. So I just gave it a try to see what I get, that would be most important for me regardless.

I start with the English wikipedia (I used a May 2017 dump). Because it is sorta big and I can put the results here, everyone knows it and it’s public data. Gensim nicely comes with a script to parse it for dictionary and corpus:

python -m gensim.scripts.make_wiki

Then some code to build different sizes of topic models (25 to 200 topics in 25 topic size increments)

The code above drops a set of 9 different sized topic models into matching directories. Both for default parameters and autotuned parameters. Takes a while to run. The machine I ran it on has 32GB RAM and a quad-core Core i7 processor (hyperthreads to 8 virtual cores). Resource use? I actually found the Gensim implementations are quite nicely optimized not to take huge amounts of memory, and they also pretty much make use of all the cores in a system. Except perhaps the topic cohesion ones that seemed to run single core still. Perhaps because they seem relatively new?

My first mistake in this regard was to think of LDA as a single-core solution. I implemented the original algorithm some times back, and did not see it becoming anything else. But the online version seems to batch it in pieces, which I guess makes it more parallelizable. And the Gensim docs also nicely describe how running this online algorithm now also merges the results in a way that you don’t necessarily need to run large numbers of passes (iterations) over the corpus to converge on a better model. Chunksize 10000 in the above code seems to cause this merge after each 10000 docs, and with Wikipedia having about 4 million articles, this amounts for quite a few merges. Maybe somewhat equal to iterations of old.

With logging enabled, Gensim prints some texts about “topic diff” between each batch and merge. This seems to indicate how much the topic model changed between the runs. So I plotted the topic diff for the wikipedia run (when generating the LDA models), to see how much the topics drift during the run. See figure below for the 9 sizes I used, using Gensim default LDA parameters:

And for using the autotuned parameters:

From this, it seems the topic model actually pretty much “converges” quite early in the process. That is, the topic diff goes down to a small number and the topics become quite stable across merges/iterations. Maybe because there is so much data in this dataset? And the autotuned version seems much more direct to converge. So I will use that later.

After this, I ran the same analysis on a bunch of document sets I have from different Finnish organizations. I won’t be putting the exact data for those documents online here, but I will show some statistics on the runs and the models produced, as well as my feeling from looking at the topics generated and the stats. Some stats when running the autotuned version (because the autotuned seemed to converge faster and about equally on quality on wikipedia):

type id

doc count

1

3651

2

1930

3

679

4

5596

5

1058

6

343

7

228

8

1069

9

333

10

213

11

279

12

316

13

592

14

397

15

104

16

1076

17

1648

Since these have a very small number of documents when compare to Wikipedia, I ran the Gensim LDA model generator for them in the online mode using batch size of 1000. Separately with 10 iterations and 100 iterations to get some comparable data on impact of iteration counts. Listing all 3×3 grids for the 17 document sets would be a bit much to show here. So after looking at them, I figured they were mostly similar but with maybe a few minor differences. So I picked three types (based on my feelings when looking at the figures):

Type 1 (this grid is for doc set with type id 6 from above):
10 iterations:

100 iterations:

Type 2 (this grid is for doc set with type id 5 from above):
10 iterations:

100 iterations:

Type 3 (this grid is for doc set with type id 7 from above):
10 iterations:

100 iterations:

Remember, the types are just something I made up myself. I chose Type 1 to refer to models where there was a big difference from 10 iterations to 100 iterations in the final topic diff for the 25 topic run. In the example Type 1 figures here (for doc type 6), the 10 iteration run gets to around 0.25 final diff. In my set for type 1, document sets 2, 16, and 17 had the biggest diff of about 0.5 in the end after 10 iterations. Document sets 3, 6, 9, 12, 13, and 14 were close to 0.2 diff after 10 iterations. Document sets 10 and 11 were close to 0.1 diff for 10 iterations. Each of these was close to 0 final diff after 100 iterations.

Type 2 refers to models where the 25 topics line has a noticeable “jiggly” effect to it. Maybe this is between the iterations (or “passes”)? Not sure how Gensim restarts iterations, so could have something to do with it. Topics for document sets 5 and 8 had the biggest such effects, as also shown in the Type 2 figure above for document set 5. For document sets 1 and 4, the effect was smaller but still seemed to be there.

Type 3 refers to models where there was no big difference in final topic diff in 10 vs 100 iterations. This was just the models for document sets 7 and 15. These are also the two smallest document sets (least docs). Maybe smaller sets converge better with fewer iterations?

Looking at the document type count table above, there is no clear correlation with document count and the types of figures (1,2,3) I used above. There could be other differences in properties of the documents (e.g., length, number of real distinct topics embedded in each). Not in my scope to investigate further, but the reasons could be anything, what do I know.

The properties I used to select the types are mostly visible in the smaller number of topics. With higher number of topics they all seem quite similar. Maybe the algorithm has to work harder to fit the data into fewer topics? Or maybe I just have so little data there that larger number of topics always produces garbage topics uniformly? No idea, really.

And once the models are built, the Gensim cohesion estimatior can be run to evaluate which of these is best according to Gensim. I used the u_mass evaluator here, since it does not require the corpus to be reloaded. According to this website, others such as c_v are more accurate while u_mass is faster. For my experiments I am just looking for a general experience on usefulness of the coherence measure here. If I had more motivation and resources I might try the others as well. Mostly resources, since my results are not too good and further exploration would be interesting to make the results better. But lets not jump too far. Code:

Have to say, maybe not very excited. Mostly the topics make at least some sense but many of those coherence measures show higher values for bigger numbers. Like 100 iteration coherence for document sets 7 and 15 showing a set of topics around 150 would be great. Doc set 15 even has fewer documents that that. Manually looking at the generated topics, a large number them are almost the same topics actually. They have mostly the same words, and very low weights for topics/words, meaning very few words in the docs got assigned to the topics. So it would seem that for most purposes topic count for these document sets is better at the lower number of topics. Unless maybe if you want to capture really fine grained differences in topics. Not sure what that would be good fo but maybe it has some use cases.

So if the smaller number of topics would be better, maybe I need to try even smaller number of topics. Seems reasonable given the smallish number of documents I have. Like number of topics at 5, 10, 15, 20. See where that takes me. Here we go:

Doc set id

coherence (autotuned parameters, 100 iterations)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Comparing these figures with the ones before for topic counts 25-200, the lower number of topics generally scored better here. Just for a quick comparison, most of these 2-20 sizes have the highest score close to -0.5 to -0.7, while the best scores for 25-200 were closer to -1.0. The difference being againg topic 15, which trolls us again with a value close to -0.8 at 3 and 150 topics. Eh.

For final comparison and seeing what I think of the topics found at different sizes, I simply manually examined the topics by printing them to files like so:

After dumping all my doc sets (1-17) like this, and looking at the ones getting the highest/lowest cohesion values, I could not really say in any way that the values would have been better for the highest cohesion values. Certainly for these small document sets, the smaller topic counts were better if looking for clearly distinct topics. Which I think most people would look for. So I am sure there is some value here. And trying out the more accurate cohesion metrics such as c_v (as discussed at the beginning of this post) would probably give better results. Maybe someday.

Alternatively, for a more visual exploration, there is also the option to use the LDAvis package. Wikipedia example:

This dumps the whole LDAvis thing into a HTML file you can then load up any time later and play with. Nice thing about this is that it can be run on a headless remote server, and produces a single HTML file (a bit large but anyway). This HTML file can then be downloaded and opened from a local file. So no webserver needed anywhere, and the interactive visualization can be shared as a single file.

How does it look? To continue avoiding dumping the Finnish datasets here, I use examples for 25, 100 and 200 topics from Wikipedia:

25:

100:

200:

The first (and biggest) topic in the list of 25 is related to movies. Same for the 100 topics. In 200 topics, music takes the first spot. In 200, the second is about novels (book), third football, and finally movies come fourth.

In the LDAvis figure here for 25 topics, the cluster of four smaller ones on the right are related to Asian countries. In the topic word list below for 25 topics, these are topics 4, 14,16, and 20. The numbering is just different because they are ordered differently. The LDAvis figure above for 200 topics also has a cluster of small ones on the left, with many of those for countries/states but also some for other topics such as chess, church, weightlifting and more. I am sure this would also be an interesting topic to study, why PCA grounds them together.

In general, there are a number of parameters to play with in LDAvis, and I don’t pretend to know all of/about them. For example, you can cycle through the topics using the controls on the top as well. A handy tool for topic exploration.

But I do also prefer just using the textual outputs of the topics as shown below. To see a large number of topics at once vs cycling through one at a time. Maybe some combination would work best.

The 25 and 100 topics from wikipedia for my text output code above:

25 Wikipedia topics (I manually tried cut these to 20 top words from 100 I printed, so its ~20 words each):

I find these topics for Wikipedia to be pretty good and clear topics. More data obviously gives better topics. I am still running the cohesion metrics for these for Wikipedia. Even if u_mass is supposed to be faster, it took me 4 days to run it just for the 25 topics on Wikipedia. So it would take me weeks to run it for all the 25-200 sized topic counts. If I ever finish it, maybe I will post some update.

I am sure there would be lots of interesting this there to explore via Wikipedia by increasing topic counts, looking at the relations between topics, how they evolve as the numbers increase and so on. Unfortunately, I am not paid for this and have too many other things to do..

So if I want to apply topic models, what would I do right now (NLP is getting lots of attention so who knows in a few years..)? Try a number of different topic distributions and parameters if possible, look at the models manually both in text and visually, and pick a nice configuration. Depends really if the topics are used for human consumption as such or just as some form of automated input.

If I needed to model large numbers of separate sets that are evolving over time, I might just use the cohesion metrics along with some heuristics (e.g., number of docs vs number of topics) to make automated choices, run the things as micro-services at intervals and use the results automatically. Tune as needed over time.

Fewer and more static sets might benefit from more tailored approaches.

I wanted to try a part of speech tagger (POS) to see if it could help me with some of the natural language processing (NLP) problems I had. This was in Finnish, although other languages would be nice to have supported for the future. So off I went, (naively) hoping that there would be some nicely documented, black-box, open-source, free, packages available. Preferably, I was looking for one in Java as I wanted to try using it as part of some other Java code. But other (programming) languages might work as well if possible to use as a service or something. Summary: There are a bunch of cool libs out there, just need to learn POS tagging and some more NLP terms to train them first…

I remembered all the stuffs on ParseMcParseFace, Syntaxnet and all those hyped Google things. It even advertises achieving 95% accuracy on Finnish POS tagging . How cool would that be. And its all about deep learning, Tensorflow, Google Engineers and all the other greatest and coolest stuff out there, right? OK, so all I need to do is go to their Github site , run some 10 steps of installing various random sounding packages, mess up my OS configs with various Python versions, settings, and all the other stuff that makes Python so great (OK lets not get upset, its a great programming language for stuffs :)). Then I just need check out the Syntaxnet git repo, run a build script for an hour or so, set up all sorts of weird stuff, and forget about a clean/clear API. OK, I pass, after messing with it too long.

So. After trying that mess, I Googled, Googled, Duckducked, and some more for some alternatives better suited for me. OpenNLP seemed nice as it is an Apache project, which have generally worked fine for me. There are a number of different models for it at SourceForge . Some of them are even POS tagger models. Many nice languages there. But no Finnish. Now, there is an option to train your own model . Which seems to require some oddly formatted, pre-tagged text sets to train. I guess that just means POS tagging is generally seen as a supervised learning problem. Which is fine, it’s just that if you are not deep in the NLP/POS tagging community, these syntaxes do look a bit odd. And I just wanted a working POS tagger, not a problem of trying to figure out what all these weird syntaxes are, or a problem of going to set up a project on Mechanical Turk or whatever to get some tagged sentences in various languages.

What else? There is a nice looking POS tagger from Stanford NLP group. It also comes with out-of-the-box models for a few languages. Again, no Finnish there either but a few European ones. Promising. After downloading it, I managed to get it to POS tag some English sentences and even do lemmatization for me (finding the dictionary base form of the word, if I interpret that term correctly). Cool, certainly useful for any future parsing and and other NLP tasks for English. They also provide some instructions for training it for new languages.

This training again requires the same pre-annotated set of training data with POS tagging. Seeing some pattern here.. See, even I can figure it out sometime. So there is actually a post on the internets, where someone describes building a Swedish POS tagger using the Stanford tagger. And another one instructing people (in comments) to downloaded the tagger code and read it to understand how to configure it. OK, not going to do that. I just wanted a POS tagger, not an excursion into some large code base to figure out some random looking parameters that require a degree in NLP to understand them. But hey, Sweden is right next to Finland, maybe I can try the configuration used for it to train my own Finnish POS tagger? What a leap of logic I have there..

I downloaded the Swedish .props file for the Stanford tagger, and now just needed the data. Which, BTW, I needed also for all the others, so I might as well have gone with the OpenNLP as well and tried that, but who would remember that anymore at this point.. The Swedish tagger post mentioned using some form of Swedish TreeBank data. So is there a similar form of Finnish TreeBank? I remember hearing that term. Sure there is. So downloaded that. Unpack the 600MB zip to get a 3.8GB text file for training. The ftb3.1.conllx file. Too large to open in most text editors. More/less to the rescue.

But hey, this is sort of like big data, which this should be all about, right? Maybe the Swedish .props file just works with it, after all, both are Treebanks (whatever that means)? The Swedish Treebank site mentions having a specific version for the Stanford parser built by some Swedish treebank visitor at Googleplex. Not so for Finnish.

Just try it. Of course the Swedish .props file wont work with the Finnish TreeBank data. So I build a Python script to parse it and format it more like the Swedish version. Words one per line, sentences separated with linefeeds. The tags seem to differ across various files around but I have no idea about how to map them over so I just leave them and hope the Stanford people have it covered. (Looking at it later, I believe they all treat it as a supervised learning problem with whatever target tags you give.)

Tried the transformed file with the Stanford POS tagger. My Python script tells me the file has about 4.4 million sentences, with about 76 million words or something like that. I give the tagger JVM 32GB memory and see if it can handle it. No. Out of memory error. Oh dear. It’s all I had. After a few minor modifications in the .props file, and I make the training data set smaller until finally at 1M sentences the tagger finishes training.

Meaning the program runs through and prints nothing (no errors but nothing else either). There is a model file generated I can use for tagging. But I have no idea if this is any good or not, or how badly did I just train it. Most of the training parameters have a one-line description in the Javadoc, which isn’t hugely helpful (for me). Somehow I am not too confident I managed to do it too well. Later as I did various splits on the FinnTreeBank data for my customized Java tagger and the OpenNLP tagger, I also tried this one with the 1.4M sentence test set. Got about 82% accuracy, which seems pretty poor considering everything else I talk about in the following. So I am guessing my configuration must have been really off since otherwise people have reported very good results with it. Oh well, maybe someone can throw me a better config file?

This is what running the Stanford tagger on the 1M sentence set looked like on my resource graphs:

So it mostly runs on a single core and uses about 20GB of RAM for the 1M sentence file. But obviously I did not get it to give me good results, so what other options do I have?

During my Googling and stuff I also ran into a post describing writing a custom POS tagger in 200 lines of Python. Sounds great, even I should be able to get 200 lines of Python, right? I translated that to Java to try it out on my data. Maybe I will call my port “LittlePOS”. Make of that what you will :). At least now I can finally figure out what the input to it should be and how to provide it, since I wrote (or translated) the code, eh?

Just to quickly recap what (I think) this does.

Normalize all words = lowercase words, change year numbers to “!YEAR” and other numbers to “!DIGIT”.

Collect statistics for each word, how often different POS tags appear for each word. A threshold of 97% is used to mark a word as “unambiguous”, meaning it can always be tagged with a specific tag if it has that tag 97% or more times in the training data. The word also needs to occur some minimum number of times (here it was 20).

Build a set of features for each POS tag. These are used for the “machine learning” part to learn to identify the POS tag for a word. In this case the features used were:

Suffix of word being tagged. So its last 3 letters in this case.

Prefix of word being tagged. Its first letter in this case.

Previous tag. The tag assigned to previous word in sentence.

2nd previous tag. The tag assigned to the previous word to the previous word :).

Combination of the previous and previous-previous tags. So previous tag-pair.

The word being tagged itself.

Previous tag and current-word pair.

Previous word in sentence.

Suffix of previous word, its 3 last letters.

Previous-previous word. So back two spots in the sentence where we are tagging.

Next word in sentence.

Suffix of next word. Its 3 last letters.

Next-next word in sentence. So the next word after the next word. To account for the start and end of a sentence, the sentence word array is always initialized with START1, START2 and END1, END2 “synthetic words”. So these features also work even if there is no real previous or next word in the sentence. Also, word can be anything, including punctuation marks.

Each of the features is given a weight. This is used to calculate prediction of what POS tag a word should get based on its features in the sentence.

If, in training, a word is given (predicted) a wrong tag based on its features, the weights of those features for the wrong tag are reduced by 1 each, and the weights for those features for the correct tag are increase by 1 each.

If the tag was correctly predicted, the weights stay the same.

Getting this basic idea also helps me understand the other parsers and their parameters a bit better. I think this is what is defined by the “arch” parameter in the Stanford tagger props file, and would maybe need a better fix? I believe this setting of parameters must be one of the parts of POS tagging with the most diverse sets of possibilities as well.. Back to the Stanford tagger. It also seemed a bit slow at 50ms average tagging time per sentence, compared to the other ones I discuss in the following. Not sure what I did wrong there. But back to my Python to Java porting.

I updated my Python parser for the FinnTreeBank to produce just a file with the word and POS tag extracted and fed that LittlePOS. This still ran out of memory on the 4.4M sentences with 32GB JVM heap. But not in the training phase, only when I finally tried to save the model as a Protocol Buffers binary file. The model in memory seems to get pretty big, so I guess the protobuf generator also ran out of resources when trying to build 600MB file with all the memory allocated for the tagger training data.

In the resources graph this is what it looks like for the full 4.4M sentences:

The part on the right where the “system load” is higher and the “CPU” part looks to bounce wildly is where the protobuf is being generated. The part on the left before that is the part where the actual POS tagger training takes place. So the protobuf generation actually was running pretty long, my guess is the JVM memory was low and way too much garbage collection etc. is happening. Maybe it would have finished after few more hours but I called it a no-go and stopped it.

3M sentences finishes training fine. I use the remaining 1.4M for testing the accuracy. Meaning I use the trained tagger to predict tags for those 1.4M sentences and count how many words it tagged right in all of those. This gives me about 96.1% accuracy on using the trained tagger. Aawesome, now I have a working tagger??

The resulting model for the 3M sentence training set, when saved as a protobuf binary, is about 600MB. Seems rather large. Probably why it was failing to write it with the full 4.4M sentences. A smaller size model might be useful to make it more usable in a smaller cloud VM or something (I am poor, and cloud is expensive for bigger resources..). So I tried to train it on sentences of size 100k to 1M on 100k increments. And on 1M and 2M sentences. Results for LittlePOS are shown in the table below:

Sentences

Words correct

Accuracy

PB Size

Time/1

100k

21988662

88.7%

90MB

4.5ms

200k

22490881

90.7%

153MB

4.1ms

300k

22608641

91.2%

195MB

3.9ms

400k

22779163

91.9%

233MB

3.8ms

500k

22911452

92.4%

268MB

3.7ms

600k

23033403

92.9%

304MB

3.5ms

700k

23095784

93.1%

337MB

3.7ms

800k

23149286

93.4%

366MB

3.5ms

900k

23169125

93.4%

390MB

3.2ms

1M

23167721

93.4%

378MB

3.3ms

2M

23520297

94.8%

651MB

3.0ms

3M

23843609

96.2%

890MB

2.0ms

1M_2

23105112

93.2%

467MB

ms

3M_0a

20859104

84.1%

651MB

1.7ms

3M_0b

22493702

90.7%

651MB

1.7ms

Here

Sentences is the number of sentences in the dataset.

Correct is the number of words correctly predicted. The total number of words is always 24798043 as all tests were run against the last 1.4M sentences (ones left over after taking the 3M training set).

Accuracy is the percentage of all predictions that it got right.

PB Size is the size of the model as a Protocol Buffers binary after saving to disk.

Time/1 is the time the tagger took on average to tag a sentence.

The line with 1M_2 shows an updated case, where I changed the training algorithm to run for 50 iterations instead of the 10 it had been set to in the Python script. Why 50? Because the Stanford and OpenNLP seem to use a default of 100 iterations and I wanted to see what difference it makes to increase the iteration count. Why not 100? Because I started it with training the 3M model for 100 iterations and looking at it, I calculated it would take a few days to run. The others were much faster so plenty of room for optimization there. I just ran it for 1M sentences and 50 iterations then, as that gives an indication of improvement just as well.

So, the improvement seems pretty much zero. In fact, the accuracy seems to have gone slightly down. Oh well. I am sure I did something wrong again. It is possible also to take the number of correctly predicted tags from the added iterations during training. The figure below illustrates this:

This figure shows how much of the training set the tagger got right during the training iterations. So maybe the improvement in later iterations is not that big due to the scale but it is still improving. Unfortunately, in this case, this did not seem to have a positive impact on the test set. There are also a few other points of interest in the table.

Back to the results table. The line with 3M_0a shows a case where all the features were ignored. That is, only the “unambiguous” ones were tagged there. This already gives the result of 84.1%. The most frequent tag in the remaining untagged ones is “noun”. So tagging all the remaining 15.9% as nouns gives the score in 3M_0b. In other words, if you take all the words that seem to clearly only have one tag given for them, given them that tag, and tag all the remaining ones as nouns, you get about 90.7% accuracy. I guess that would be the reference to compare against.. This score is without any fancy machine learning stuffs. Looking at this, the low score I got for training the Stanford POS tagger was really bad and I really need that for dummies guide to properly configure it.

But wait, now that I have some tagged input data and Python scripts to transform it into different formats, I could maybe just modify these scripts to give me OpenNLP compliant input data? Brilliant, lets try that. At least OpenNLP has default parameters and seems more suited for dummies like me. So on to transform my FinnTreeBank data to OpenNLP input format and run my experiments. Python script. Results below.

Sentences

Words correct

Accuracy

PB Size

Time/1

100k

22247182

89.7%

4.5MB

7.5ms

200k

22680369

91.5%

7.8MB

7.6ms

300k

22861728

92.2%

10.4MB

7.7ms

400k

22994242

92.7%

12.8MB

7.8ms

500k

23114140

93.2%

14.8MB

7.8ms

600k

23199457

93.6%

17.1MB

7.9ms

700k

23235264

93.7%

19.2MB

7.9ms

800k

23298257

94.0%

21.1MB

7.9ms

900k

23324804

94.1%

22.8MB

7.9ms

1M

23398837

94.4%

24.5MB

8.0ms

2M

23764711

95.8%

39.9MB

8.0ms

3M

24337552

98.1%

55.9MB

8.1ms

(4M)

24528432

98.9%

69MB

9.6ms

4M_2

6959169

98.5%

69MB

9.7ms

(4.4M)

24567908

99.1%

73.5MB

9.6ms

There are some special cases here:

(4M): This mixed training and test data in training with the first 4M of the 4.4M sentences, and then taking the last 1.4M of the 4.4M for testing. I believe in machine learning you are not supposed to test with the training data or the results will seem too good and not indicate any real world performance. Had to do it anyway, didn’t I 🙂

(4.4): This one used the full 4.4M sentences to train and then tested on the subset 1.4M of the same set. So its a broken test again by mixing training data and test data.

4M_2: For the evaluation, this one used the remaining number of sentences after taking out the 4M training sentences. So since the total is about 4.4M, which is actually more like 4.36M, the test set here was only about 360k sentences as opposed to the other where it was 1.4M or 1.36M to be more accurate. But it is not mixing training and test data any more. Which is probably why it is slightly lower. But still an improvement so might as well train on the whole set at the end. The number of test tags here is 7066894 as opposed to the 24798043 in the 1.4M sentence test set.

And the resource use for training at 4M file size:

So my 32GB of RAM is plenty, and as usual it is a single core implementation..

Next I should maybe look at putting this up as some service to call over the network. Some of these taggers actually already have support for it but anyway..

A few more points I collected on the way:

For the bigger datasets it is obviously easy to run out of memory. Looking at the code for the custom tagger trainer and the full 4.4M sentence training data, I figure I could scale this pretty high in terms of sentences processed by just storing the sentences into a document database and not in memory all at once. ElasticSearch would probably do just fine as I’ve been using it for other stuff as well. Then read the sentences from the database into memory as needed. The main reason the algorithm seems to need to keep the sentences in memory is to shuffle them randomly around for new training iterations. I could just shuffle the index numbers for sentences stored in the DB and read some smaller batches for training into memory. But I guess I am fine with my tagger for now. Similarly, the algorithm uses just a single core in training for now, but could be parallelized to process each sentence separately quite easily, making it “trivially parallel”. Would need to test the impact on accuracy though. Memory use could probably go lower using various optimizations, such as hashing the keys. Probably for both CPU and memory plenty of optimizations are possibly, but maybe I will just use OpenNLP and let someone else worry about it :).

From the results of the different runs, there seems to be some consistency in LittlePOS running faster on bigger datasets, and the OpenNLP slightly slower. The Stanford tagger seems to be quite a bit slower at 50ms, but could be again due to configuration or some other issues. OpenNLP seems to get a better accuracy than my LittlePOS, and the model files are smaller. So the tradeoff in this case would be model size vs tagging speed. The tagging speed being faster with bigger datasets seems a bit odd but maybe more of the words become “unambigous” and thus can be handled with a simple lookup on a map?

Finally, in the hopes of trying the stuff out on a completely different dataset, I tried to download the Finnish datasets for Universal Dependencies and test against those. I got this idea as the Syntaxnet stats showed using these as the test and training sets. Figured maybe it would give better results across sets taken from different sources. Unfortunately Universal Dependencies had different tag sets from the FinnTreeBank I used for training, and I ran out of motivation trying to map them together. Oh well, I just needed a POS tagger and I believe I now know enough on the topic and have a good enough starting point to look at the next steps..

But enough about that. Next, I think I will look at some more items in my NLP pipeline. Get back to that later…