Simple text classification with a little F# (Part 1)

Disclaimer: There are many great off-the-shelfpackages available for machine learning and text classification, and you’d be better served using those rather than rolling your own. This post is mostly intended to be an easy-to-understand tutorial on text classification.

How I learned to teach a computer something

At a previous job, I was faced with a problem of document classification. Workers were poring over huge files, page by page, manually building a table of contents for each one. It took a long time to train workers on how to classify these documents, and even the best workers were inconsistent. There was no good way to scale the operation.

We needed a system that could ingest large files, each with thousands of pages of text, and automatically classify each individual document they contained. I knew there had to be a way to teach a computer how to recognize this stuff. This was a new frontier for me, so I started reading a lot of wikipedia articles and thinking really hard.

Training data

There were a few hundred types of documents we needed to identify. We had hundreds of pre-labeled samples of each document type. I knew I needed to leverage those samples to build something that could read a document it’s never seen and tell me which category it fell under. Those samples would be my training data.

It’s often the case, at least with supervised machine learning, that your training set plays a much more important role in your classifier’s accuracy than which algorithm or model you use.

We’ll revisit this topic once we get a working prototype and want to improve its accuracy.

Boiling it down

I needed to grind up those samples so I could derive some useful information from their contents. Firstly, I needed to discard any noise: punctuation, symbols, etc. Let’s define some functions for sanitizing our samples:

Tokenization

Tokenization is the process of breaking down the text into its constituent words (or tokens). We’ll simply break the text up into individual words by splitting on whitespace. Then we can optionally combine those individual words into n-grams.

We’re using a simple bag of words model. Imagine you’ve ripped a page from a book and very meticulously cut out each word with a tiny pair of scissors and put them in a bag. You’d have a bag of unigrams.

An individual word is a unigram, a pair of words is a bigram, and so on. Extracting n-grams larger than a unigram can be done using a “sliding window”. For example, the string these pretzels are making me thirsty would produce the following trigrams: these pretzels are, pretzels are making, are making me, making me thirsty.

Term frequency

Regardless of which n-gram size we choose, we still end up with a bag of them. What information can we derive from the bag’s contents? We’ve lost any information about the original order of the words, but we can count how many times we find each n-gram in the bag.

If your document domain is fairly homogenous, larger n-grams might help differentiate between similar documents. However, using too large an n-gram may cause your classifier to only recognize documents that are very similar to your training set.

For efficiency’s sake, we’ll map each n-gram to an integer, e.g. cat will always map to 1, dog will always map to 2, etc. We could technically do without this, but working with the original strings throughout our classifier would waste memory and CPU cycles.

typeTerm=int

We’ll define a term frequency tuple that pairs a Term with the number of times it appears in a document:

typeTermFrequency=Term*int

We’ll define a record type for representing each document we’re using as a training sample. It holds the path to the physical file and its set of term frequencies: