Unstructured data: an architect’s guide to text

I recently completed an excellent book which examines how to deal with information presented as text. It’s called Taming Text from Manning. The authors do a good job of introducing each topic and explain how a number of open source tools can be applied to the problems each topic presents. I’ve not studied the latter but the former are a great introduction.

I have summarised each topic, below:

It’s hard to get an algorithm to understand text in the way humans can. Language is complex and an area of much academic study. Text is everywhere and contains plenty of potentially useful information.

The first step in dealing with text is to break it down into parts and the most simplistic aim of this step is to extract individual words, however there are a number of approaches and more sophisticated ones will need to handle punctuation. The process of splitting text down is called tokenisation. Individual works may often then be put through a stemming algorithm in order to be able to equate pluralised and different tenses of the same stem. A stem might be a recognisable word but not necessarily.

In order to search content it must first be indexed which will require tokenisation and stemming and maybe also stop-word removal and synonym expansion. It is also useful, for subsequent ranking, to use an index that allows the distance between words found from the search phrase to be calculated for each document searched. There are a number of algorithmic approaches for ranking results the simplest of which are based on the vector space model. Obviously ranking is an evolving area and the big internet search engines are constantly evolving it. Another refinement that can be applied to search is the key constituent of spell-checking: fuzzy matching.

Fuzzy matching is another area of academic research with some established algorithms based on character overlap, edit distance and n-gram edit distance which may all be combined with prefix matching using a trie (prefix tree). The most important aspect of fuzzy matching to understand is that different algorithms will be more or less effective depending on the sort of information being matched, for example Movie Titles are best matched on Jaro-Winkler distance but Movie Actors are bet matched with a more exact algorithm given that they are used like brand names.

It can be useful to be able to extract people, places and things (including monetary amounts, dates, etc.). Again there are a number of algorithms for achieving this including open source implementations from the OpenNLP project. Machine learning can play a part provided there is plenty of tagged (training) examples available, which is especially useful where domain-specific text needs to be ‘understood’.

Given a large set of documents often there is a requirement to group similar documents. This is a process called clustering and can be observed in operation on news amalgamation sites. Note that clustering does not assign meaning to each cluster. There are a number of established algorithms, many of which are shared with other clustering problems. Given the large volumes and algorithm complexity a real-world clustering task is quite likely to want to make use of parallel processing and this is what the Carrot and Apache Mahout projects provide by building on top of Apache Hadoop.

Another activity for sets of documents is classification which is similar to clustering but starts with sets of documents that have been assigned to a pre-determined category by a human or other mechanism, for example asking users to tag articles. Example classification tasks are sentiment analysis or rating reviews as positive or negative. Of course there are a number of algorithms and implementations to choose from with each having trade-offs in accuracy and performance.