Friday, June 27, 2008

Up until now, we've measured the error rates of our various learners without worrying too much about what good an error-prone machine learner actually is. By dividing the learner's responses into the four categories of hit, miss, false positive and correct negative, we can get a more nuanced picture of what it is doing when it makes a mistake. Here we look at false positives, trials that the learner mistakenly identifies as belonging to the category of interest. We start by writing a program that goes through each of the TFIDF-50 learner's responses for the various offence categories in the 1830s. It collects all of the false positives, making a note of what offence category each trial actually belongs to. The code to do this is here. We can then plot the information in a convenient form. I've decided to use pie charts.

The figure below shows the results for the offence category of assault, coded as a way of breaking the peace. What happens when our learner thinks that a trial is an example of this category but it really isn't? About 38.6% of the time, the trial in question was actually categorized as indecent assault (sexual), and about 38.6% of the time it was assault with intent (also sexual). Almost 11% of the time, the trial was a case of assault with sodomitical intent, and another 8% of the trials were actually categorized as an instance of wounding. In other words, about 96% of the learner's false positive "errors" in this case were other kinds of assault. What of the trials classified as "miscellaneous - other"? One was this trial, where 44 year old William Blackburn was found guilty of "unlawfully and maliciously administering to Hannah Mary Turner 6 drachms of tincture of cantharides, with intent to excite, &c." I understand that this case probably doesn't fit the definition of assault used by either Blackburn's contemporaries or by the person who coded the file. Nevertheless, it is not completely unrelated to the idea of an assault, and is exactly the kind of source that a historian could use to shed light on gender relations, sexuality, or other topics.

The next figure shows the false positives for fraud, categorized as a kind of deception. Seventy-two percent of the learner's false positives in this case were actually categorized as coining offences, and another 12% were actually cases of forgery. Once again, the vast majority of cases that were incorrectly identified as fraud belonged to relatively closely related offence categories. Note that these results cannot be explained by appealing to the distribution of offences in the sample as a whole. If the false positives were selected by the learner at random, we would expect most of them to be cases of larceny, which are by far the most commonly attested. Instead we see that a learner trained to recognize one kind of assault is confused by other kinds of assault, and one trained on fraud by other kinds of fraud.

A learner trained on manslaughter is mostly confused by cases of wounding and murder, as shown in the next figure.

Finally we can consider a kind of theft, in this case housebreaking. If any learner were going to be confused by larceny cases, it should be one trained to recognize a type of theft. Instead, this learner is more confused by the less-frequently attested but more closely related categories of burglary and theft from place.

Now we are in a position to provide one kind of answer to the question, "what good is an error-prone learner?" Since the learner's errors are meaningfully related to its successful ability to categorize, we can use false positives as a way of generalizing beyond the bounds of hard and fast categorization. If we used a search engine to find cases of assault we might miss some of the most interesting such cases (like the cantharides example) ... cases that are interesting precisely because they lay just outside the category. One of the things that machine learning gives us, is a way of finding some of the more interesting exceptions to our rules.

Wednesday, June 25, 2008

We feel pretty confident that the performance of the TFIDF-50 version of the naive bayesian learner is going to be relatively stable regardless of the frequency with which a particular offence is attested. At this point we can write a routine which tests the learner on each of the offences which occurred 10 or more times in the 1830s. Our testing routine takes advantage of the fact that, unlike many other kinds of machine learner, the naive bayesian can be operated in online mode. What this means is that we can train the learner on some data, test its performance, then train it on some more data. Many learners can only be operated in offline or batch mode. This means they have to be trained on all of the data before they can be tested, and there is no way at that point to subject them to further training. The fact that the naive bayesian can be used for online learning will turn out to be crucial for us.

The code for testing is here. The learner is given the trials in chronological order, one at a time. The way that the program works is that it first uses the current state of the learner to classify a trial. The classification is scored as a hit, miss, false positive or correct negative, then the trial is used to train the learner (with the appropriate category being given as feedback). The learner is then given the next trial to judge. Once the learner has seen all of the data, the final count of hits, misses, etc. is output and the performance plotted as in previous posts. The results are shown below for the 1830s.

As can be seen, the performance is pretty stable, considering that different offences make up values ranging between 0.077% (for perverting justice 10/12959) and 42.48% of the total (for simple larceny 5505/12959). The system gets very few false positives for bigamy, and quite a few for shoplifting. We'll look at why this is the case in the next post. It is very accurate for the most frequently attested offence, simple larceny, and relatively inaccurate for the infrequently attested offences of kidnapping (11/12959) and perverting justice (10/12959). The central part of the plot is magnified and shown in the figure below. The performance of the learner varies for similar sorts of crime (e.g., it performs better for indecent assault than assault), something that we will take up next.

Sunday, June 22, 2008

In our last post, we settled on a style of plotting that shows both how accurate our learner is (i.e., does it miss very often?) and how precise (i.e., how often does it return a false positive?) We also decided to do experiments with the version of the naive bayesian learner that uses the items with the highest tf-idf as features. Our experiments to date have used the category of simple larceny in the 1830s. This offence is very well-attested, making up about 42.5% of the trials (5505/12959). At this point, we can try the performance of the same learner on offence categories that are less frequent: stealing from master (1718/12959, approx. 13.3%) and burglary (279/12959, approx 2.2%). We've been using the 15 terms with the highest tf-idf, but we should try some other values for that parameter, too. A graph for the three different offence categories is shown below. The four learners use the top scoring 15, 30, 50 and 100 items, respectively.

From the graph, it is pretty clear that it is easiest to learn to categorize larceny, which is the best-attested offence we looked at. We can also see that the TFIDF-15 learner does particularly poorly by missing many instances of the less frequent offences. Increasing the number of features the learner can make use of seems to improve performance up to a point. After that, increasing features increases the number of false positives the learner makes. We want the performance of our learner to be relatively robust when learning offence categories that are more or less frequently attested, which means we want the learner with the tightest grouping of results for these test categories (in other words, TFIDF-50).

Note that in this test, we only ran each learner once on each data set, rather than doing ten-fold cross-validation. Our experiments with cross-validation suggested that the different versions of the learner were relatively insensitive to the order in which training and testing trials were presented. Since this is exploratory work, we will make the (possibly incorrect) assumption that a single trial is probably representative. This will let us do a lot more testing in the same amount of time.

Saturday, June 21, 2008

There are many different ways to measure the performance of our various learning algorithms. The error rate that we've been using so far we defined as the sum of misses and false positives divided by the total number of trials. By this measure, COINFLIP had an average error rate around 50%, and our naive bayesian learner had an error rate around 40% using one word features, and around 26% using either 2-grams or top-scoring tf-idf features. I thought I might be able to get better performance by using only those 2-grams that included terms with a high tf-idf, but that learner had an error rate around 26%, too. (Recall that we've been using cases of simple larceny in the 1830s for our experiments... the performance will be different for other offences and/or other decades. We'll test some of these soon.)

By using a different measure, we can see that our various learners achieve their results in different ways. From our perspective as researchers, the least interesting category of answers are the correct negatives. Misses are a problem, because they may contain evidence that relates to the argument that we're trying to construct. False positives are a problem, because they are irrelevant but we have to look through them to determine that... in other words, they're a waste of time. A perfect learner would return all and only hits. If we consider the ratio of misses to hits we can get an idea of how accurate our learner is. As a learner gets better, the ratio of misses to hits approaches 0. As it gets worse, the ratio increases. A disastrous learner might not get any hits, so to avoid a division by zero error, we'll add one to the denominator. Our accuracy measure is thus misses / (hits + 1). If we consider the ratio of false positives to hits we can find out how precise our learner is. As it gets better, this ratio will go to zero, and as it gets worse, the ratio will increase. Our precision measure is false positives / (hits + 1). We can plot both measures on the same graph, with the origin in the lower left hand corner, as shown below. Since some of the values are large, I've used logarithmic axes. (Also, the results for YES and NO actually lie on the respective zero lines, but I've bumped them over so they can be seen in this plot.)

Looking at the graph we notice some interesting results. The naive bayesian that uses words for features gets relatively few false positives, but at the cost of missing an order of magnitude more items than the other two learners. The 2-gram learner outperforms COINFLIP and the tf-idf learner on false positives, but not on misses. The tf-idf learner is the only one that outperforms COINFLIP in terms of both accuracy and precision. Thus we will do our next round of experiments with the tf-idf learner.

Wednesday, June 18, 2008

In the last post, we got a naive bayesian learner working and used it to categorize some Old Bailey trials from the 1830s as examples of larceny (or not). Our initial version of the learner was easy to implement, but it made the unrealistic assumption that the probabilities of particular words appearing in the text of a trial were independent. That greatly simplified computation at the cost of performance. Our initial learner had an error rate around 40%. We then revised it to use 2-grams as features rather than individual words. This captured some of the dependency between words, improving our average error rate so it was close to 25%.

An alternative approach is to try and concentrate on the words in a trial which are most representative of a particular category. Without specifying these words in advance, we can make the assumption that they will be relatively frequent in the document in question, but relatively infrequent in the overall corpus of documents. One common measure for this is known as tf-idf. Rather than handing all of the words in a given trial to our learner, or all except the stop words, we will only hand off the 15 or 20 with the highest tf-idf. There are many different ways to compute this measure. The version that I used is tfidf = log(tf+1.0) * log(numdocs/df), where tf is the number of times the word occurs in a particular text, numdocs is the total number of documents, and df is number of documents that the word appears in. The word "cellar," for example, appears in this trial seventeen times, and in 221 other trials in the 1830s. The tfidf for this word in this trial is log(17+1) * log(12959/221) = 11.76781.

To compute the tf-idf, we first need to create a list of every word that was used in all of the trials, and the number of different trials in which each word appears. We could put this information in a text file, but the file would be huge and very slow to access. Instead, we will store our document frequencies in a SQLite database, using Python commands to store and retrieve the information. The code which creates this database is here. We can then compute the tf-idf scores for each word in a given trial, creating a new directory to store these files. The code to do that is here. Finally, we will want a version of our tenfold cross-validation routine to test the performance of a naive bayesian learner that operates across tf-idf vectors rather than raw words or 2-grams (here). This new learner has similar performance to the 2-gram version, with an average error rate of 25.73% when using the 15 highest scoring tf-idf terms to categorize cases of larceny in the 1830s. As a bonus, it is remarkably fast. At this point, you're probably wondering what good a machine learner is, if one quarter of its judgments are incorrect. We'll get there.

Tuesday, June 17, 2008

At last we're in a position to actually train and test some machine learners. The one that we'll start with is called a naive bayesian. It is relatively simple to implement, although it usually doesn't perform nearly as well as fancier and more complicated learners. For our purposes, however, it has some real advantages, which we'll get to spelling out eventually. The version of the naive bayesian learner that I am going to use is the one that was implemented by Toby Segaran in his book Programming Collective Intelligence. I won't post the code for the learner here, as it is already available online. If you are able to follow this series of posts and are interested in writing machine learning code in Python, Toby's book is a must-have. The only change that I have implemented is to remove stop words before submitting the trials for training or testing. You can get instructions and code for that from The Programming Historian.

Bayesian learners make use of a theorem proposed by Thomas Bayes and published in 1763, two years after his death (for more on Bayes, see Bellhouse's biography.) The theorem states that Pr[H|E] = (Pr[E|H] * Pr[H]) / Pr[E]. Pr[H|E] is the probability that the hypothesis H is true, given some evidence E. Pr[E|H] is the probability that you would see evidence E if the hypothesis H were true. Pr[H] is the probability of the hypothesis and Pr[E] the probability of the evidence. Bayes theorem gives us a way of determining conditional probabilities: if we know one thing, how likely are we to know something else?

Let's work through a simple example. Suppose bag A contains one black marble and three white ones, and bag B contains two white marbles and two black ones. Someone gives us a black marble but doesn't remember which bag they took it from. Given that you have a black marble, what are the chances that it came from bag A? In this case, Pr[H] is the probability the marble came from bag A. Since each bag contains the same number of marbles, Pr[H] = 4/8 = 1/2. Pr[E] is the probability that a marble is black, so Pr[E] = (1+2)/8 = 3/8. Pr[E|H] is the probability that you are going to get a black marble if you choose from bag A, in other words Pr[E|H] = 1/4. So Bayes theorem says that Pr[H|E] = (1/4*1/2) / 3/8 = 1/3. Since we know that the marble had to come from one of the two bags, that means that it should have a 2/3 chance of coming from bag B, which we can double check. Pr[notH|E] = (Pr[E|notH] * Pr[notH]) / Pr[E] = (2/4*1/2) / 3/8 = 2/3, as expected. You can learn more about Bayes theorem here.

When applied to the problem of learning, Bayes theorem looks like this: Pr[category|document] = Pr[document|category] * Pr[category]. (We don't need to divide by Pr[document] in this equation because it will scale all of our results by the same amount). We make the (incorrect) assumption that the probability of each word in the document is independent from the others, so we can set Pr[document|category] equal to Pr[word1|category] * Pr[word2|category] * ... Finally, Pr[category] is simply the proportion of all documents that belong to our category of interest.

So how well does the naive bayesian learner do? Not very well. In a tenfold cross-validation run testing for cases of simple larceny in the 1830s it has an average error rate of 39.17%, compared with COINFLIP's average error rate of 49.39%. The error rate is simply (Misses + False Positives) / Total Number of Trials. Part of the problem is that we made the assumption that the probability of any word in a document is independent of the probability of any other word in the same document. We know this isn't strictly true. In the Old Bailey proceedings, for example, you find both "dwelling" and "dwelling house", as well as "victualling house", "sessions house", "station house", "house keeper" and many other forms. To the extent that these and other words tend to co-occur, the word probabilities can't be independent. We can improve the performance of our naive bayesian learner by using pairs of words (i.e., 2-grams) rather than individual words as features for the learner. This drops the error rate to 26.23% when categorizing trials for simple larceny in the 1830s. The code that tests the different learners is here. A graph of performance is shown below.

Friday, June 13, 2008

Now that we have our training and testing samples, we will be able to estimate the error rates of our various machine learners. Some of them won't be very good, especially if they are trained on relatively small or unrepresentative samples. None of them will be perfect, or even approach human performance. So it is usually a good idea to ask if the performance of a given learner is significantly different from chance. Consider three other abstract machines which don't do any learning at all.

YES is a very simple machine. When given an item and asked whether or not it is an instance of a particular category, YES says "yes". That's it. Suppose we have 100 test items and all of them are instances of our category, say 100 examples of burglary. We ask YES about each of them and it 'decides' that each is a burglary. YES makes no errors at all on this test sample! If half of the test items are not burglaries, however, YES's error rate climbs to 50%.

NO is also a very simple machine, responding "no" whenever tested. If we give it 100 examples of burglaries, it will fail to recognize every single one of them, with an error rate of 100%. The fewer burglaries our test sample contains, the better NO does.

COINFLIP is more sophisticated than YES or NO. Every time we ask COINFLIP to make a decision, it has a 50% chance of responding "yes" and a 50% chance of responding "no". Given a sample with 100 examples of burglaries, COINFLIP gets it wrong about half the time. Given a sample with no burglaries in it, COINFLIP will also have an error rate around 50%.

With these three simple machines, we can be more clear about what it means to be right or wrong, distinguishing four categories:

Hit. If the machine says "yes" and the right answer is "yes", we say that it has scored a hit. This is one kind of correct answer. Both YES and COINFLIP are capable of scoring hits, but NO never is, because it can never say "yes" to anything.

False Positive. If the machine says "yes" but the answer is really "no", we say that it has responded with a false positive, which is one kind of incorrect answer. YES and COINFLIP can reply with false positives, but NO cannot.

Miss. If the machine says "no" but the correct answer was "yes", we say that it missed. NO and COINFLIP can miss, but YES cannot, because it never says "no".

Correct Negative. This happens when the machine says "no" and the correct answer was "no". NO and COINFLIP can reply with correct negatives, but YES cannot.

We expect our learners to produce answers in each of the four categories. A machine that always hits will also tend to identify a lot of false positives. This can be good if you are looking for a needle in a haystack, but will overwhelm you if your category is well-attested. A machine that always identifies correct negatives will often miss things. These kind of machines tend to be more useful when you would never have time to go through all of your items by hand. Most machine learners have parameters that allow you to tune their performance between these extremes.

Thursday, June 12, 2008

With most of our support routines in place, we need to think about the problem of training a machine learner and then assessing its performance. A human being has already gone through each of the trials and assigned one or more offence categories to it:

So we can give each raw trial to our learner and ask it to decide what offence category the trial belongs to, then we can check our learner's answer against the human-assigned category. If we do enough of these trials, we can get a precise sense of how good our learner is.

Most machine learning researchers use a holdout method to test the performance of their learning algorithms. They use part of the data to train the system, then test its performance on the remaining part, the part that wasn't used for training. Items are randomly assigned to either the training or the testing pile, with the further stipulation that both piles should have the same distribution of examples. Since burglaries made up about 2.153% (279/12959) of the trials in the 1830s, we want burglaries to make up about two percent of the training data and about two percent of the test data. It would do us no good for all of the burglaries to end up in one pile or the other.

But how do we know whether the results that we're seeing are some kind of fluke? We use cross-validation. We randomly divide our data into a number of piles (usually 10), making sure that the category that we are interested in is uniformly distributed across those piles. Now, we set aside the first pile and use the other nine piles to train our learner. We then test it on the first pile and record its performance. We then set aside the second pile for testing, and use the other nine piles for training a new learner. And so on, until each item has been used both for testing and for training. We can then average the ten error estimates. There are many other methods in the literature, of course, but this one is fairly standard.

Code to create a tenfold cross-validation sample from our data is here. As a check, we'd also like to make sure that our offence category is reasonably distributed across our sample (code for that is here).

Monday, June 09, 2008

With raw text files for each of the trials, we're almost in a position to try doing some experiments with a machine learner. Before we get started we are going to need a few utility routines to make our lives easier. Programmers enjoy writing tools so much they have a special expression for the process: yak shaving. Sometimes it's necessary, sometimes it's just fun, sometimes it's a great way to procrastinate. We'll try to keep it in check.

First of all, we'll want lists of all of the files that need to be processed in a given decade. We could use the operating system for this, but Windows is pretty slow when you have tens of thousands of files in a directory. A program to grab the list of filenames to another text file is here.

We're also going to want a list of all of the dates on which trials occurred (in other words, we will want a list of all of the days that the court was in session). The program to generate that list and sort it in ascending order is here.

Since our initial experiments will be focused on trying to automatically categorize trials by offence (e.g., "burglary"), we are going to need a few routines that make it easier to work with offences. One of these needs to return a mapping from trial IDs to one or more categories of offence (the code is here):

t-18341124-1.txt -> theft-burglary

t-18341124-2.txt -> theft-burglary

t-18341124-3.txt -> breakingpeace-wounding

...

t-18341124-37.txt -> theft-stealingfrommaster, theft-simplelarceny

...

Another routine needs to return a mapping from a particular offence to a list of matching trial IDs (the code is here):

Saturday, June 07, 2008

At this point, we have a directory that contains one file for each trial in a given decade. There are a lot of these files (almost 13,000 for the 1830s alone) and each trial is still marked up with XML. In the next step we're going to create parallel directories that contain trial files with all of the XML stripped out. In other words, our trial files currently look like this:

1 john holgate and james holgate were indicted for that they on the 1st of october at st mary magdalen bermondsey about four o clock in the night the dwelling house of john thompson feloniously and burglariously did break and enter...

It may seem a bit perverse to take out information that the OB team worked very hard to create, so it is probably a good idea to step back and get a broader overview of the data mining process. When writing programs to manipulate digital sources you can head down one of two paths. You can choose to explicitly encode more and more semantic (i.e., meaningful) information. This is what the OB team has done with XML markup. By using <geo>...</geo> tags to indicate that "St. Mary Magdalen, Bermondsey" is a geographical location, they are able to provide a powerful search engine that can find places. Similarly, by indicating the age and sex of criminals and victims they make it possible for researchers to do a variety of sophisticated statistics on the archive as a whole. The downside, of course, is that this kind of explicit tagging is very labor-intensive. It is wonderful to be able to work with a digital archive that someone else has edited and marked up, but often you face a corpus of documents that is little better than raw text, or worse, that contains a high percentage of OCR errors.

An alternative approach is to work with domain-neutral representations and algorithms. You write programs that can't tell the difference between a person's name and a place name, between English and French, or between natural language and a genomic sequence. This is closer to what traditional search engines like Google do. The downside is that you can search for text that includes the string "Bermondsey" but you can't tell Google to limit your search to geographic uses of the term. Instead your results include the neighborhood, the tube station, a local history group, a diving club, a biography, a hymn, some photos, and so on.

Having access to text that has been semantically marked up makes it possible to create and test a wide range of powerful tools that can then be used on raw text that hasn't been marked up. For example, we know that this particular trial was for a burglary that ended with the execution of two men. Suppose we want to create a computer program which can classify trials as either "burglary" or "not a burglary." We start by creating an archive of raw text examples by stripping out the markup. We give these texts to our program, one by one, and tell it whether each text was an instance of "burglary" or not. With any luck, the program learns, somehow, to distinguish burglaries from other kinds of trial. (The details will be filled in as we go along). Now, we can test the program on other examples from this archive and get a precise sense of how well it does. If it seems to work, we can then use it to try and ferret out burglaries from other collections of untagged trial records, or even from a mass of undifferentiated search results.

So, to create a clean copy of each of the trial files we're going to use a very brute force method. We will simply copy the file, one character at a time, skipping any characters that fall between < and > inclusive. The Python code which does this job is here.

Thursday, June 05, 2008

After downloading the XML-tagged files for the nineteenth century to our local machine, we ended up with a directory tree that looks like this:

Tagged_final (944 files, 9 folders)

Tagged_1830s_Files (62 files)

T18341124NW_SUP_DONE.xml

T18341205PH.xml

...

T18391216CLR.xml

Tagged_1840s_Files (120 files)

...

Tagged_1910s_Files (41 files)

T19100111GS_SUP_DONE.xml

T19100208GS.xml

...

T19130401CLR.xml

Each XML file contains all of the trials that were conducted in a particular session. The file 'T18341124NW_SUP_DONE.xml', for example, is the record for 24 Nov 1834. I'm assuming that the string that follows the date in the filename ('NW_SUP_DONE') refers to the encoding process, so I'm going to ignore it.

The next step is to split each of these XML files into individual trials. Our overall strategy will be as follows. First we want to create a directory for the trial files if one doesn't already exist. Then we will get a list of all of the XML files for the decade and step through them one at a time. For each XML file, we're going to extract each trial and save it as a separate file. Since a given trial is delimited with tags that look like <trial id="t-18341124-1" n="1"> ... </trial>, we can parse it out and save it separately as 't-18341124-1.txt'. You can read this trial online at the Old Bailey archives. You can also have a look at the XML file to see what we're dealing with. The fact that the OB team provides XML makes this archive an awesome resource for digital historians, and other online sites should do the same forthwith.

There are a variety of ways to parse XML, but it is quick and easy to use the Beautiful Soup library for Python. The program that splits the XML files into separate trial files is here; for more information about using Beautiful Soup see The Programming Historian. There are far more trial files than session files: there were 12,959 trials in the 1830s alone. Now that we have one file for each trial, we're ready for the next step.

The Programming Historian

Are you interested in learning how to program? Check out The Programming Historian, an open-access introduction to Python programming for working historians (and other humanists) with little previous experience.