Automatic fogbugz triage with naive bayes

At my work we use fogbugz for our bugtracker and over the history of our company’s lifetime we have tens of thousands of cases. I was thinking recently that this is an interesting repository of historical data and I wanted to see what I could do with it. What if I was able to predict, to some degree of acuracy, who the case would be assigned to based soley on the case title? What about area? Or priority? Being able to predict who a case gets assigned to could alleviate a big time burden on the bug triager.

Thankfully, I’m reading “Machine Learning In Action” and came across the naive bayes classifier, which seemed a good fit for me to use to try and categorize cases based on their titles. Naive bayes is most famously used as part of spam filtering algorithms. The general idea is you train the classifier with some known documents to seed the algorithm. Once you have a trained data set you can run new documents through it to see what they classify as (spam or not spam).

For those who’ve never used Fogbugz, let me illustrate the data that’s available to me. I’ve highlighted a few areas I’m going to use. The title is what we’re going to use as the prediction value (highlighted blue), and the other red highlights are categories I want to predict (area, priority, and who the case is assigned to).

For the impatient, full source code of my bayes classifier is available on my github.

Conditional probability

Conditional probability describes the probability of an item given you already know something about it. Formally it’s described in the syntax of P(A | B), which is pronounced as “probability of A given B”. A good example is provided for in the machine learning book. Imagine you have 7 marbles. 3 white marbles, and 4 black marbles. Whats the probability of a white marble? It’s 3/7. How about a black marble? It’s 4/7.

Now imagine you introduce two buckets: a blue bucket and a red bucket. In the red bucket, you have 2 white marbles and 2 black marbles. In the blue bucket you have 1 white marble and 2 black marbles. Whats the probability of getting a white marble from the blue bucket? It’s 1/3. There is only one white marble in the blue bucket, and 3 total marbles. So, P(white marble | blue bucket) is 1/3.

Bayes Formula

This doesn’t really help though. What you really want is to be able to calculate P(red bucket | white marble). This is where bayes rule comes into play:

This formula describes how items and their conditions relate (marbles and buckets).

Conditional Independence

Naive bayes is called naive because it assumes that each occurrence of an item is just as likely as any other. Getting a white marble isn’t dependent on first getting a black marble. To put it another way, the word “delicious” is just as likely to be next to “sandwich” as it is to “stupid”. It’s not really the case. In reality “delicious” is much more likely to be next to “sandwich” than “stupid”.

The naive portion is important to note, because it allows us to use the following property of conditionally independent data:

What this formula means is that the probability of one thing AND another thing is the probability of each multiplied together. This applies to us since if the text is composed of words, and words are conditionally independent, then we can use the above property to determine the probability of text. In other words, you can expand P(text | spam) to be

The probability of the entire text is the probability of each word multiplied together.

Naive Bayes Algorithm Training

The goal of training the classifier is to find out what is the probability of a word in a particular class. If we’re going to classify the assigned user based on case text, then we need to know what is the probability of a word when the assigned user is “anton kropp“, and what is the probability when it is “other developer“.

To train the classifier, you need to have some documents whose class you know. For my example, I need to have a bunch of cases already made who are properly assigned. Using the training documents you first need to find out what are all the available words in the space. For this I tokenized the case titles into words without any special tokens (?, !, -, , etc) and I removed commmon stop words. The word space is the unique set of all the words in the training documents.

The next idea is that you treat this word space as a vector. If I am training the data on two cases with titles “this is broken” and “users don’t work“, then the vector will be:

[this | is | broken | users | dont | work]

Where the word “this” is in the 0 position, “is” is in position 1, etc. Next what we need to do is correlate each document’s word occurrence to the training word space vector. If a word exists it’ll get incremented in the right position. For example, with title “this is broken“, it’s vector will look like

[1 1 1 0 0 0]

Indicating it had 1 occurrence of “this“, 1 of “is“, and 1 of “broken“, but zero occurrences of “users“, “dont” and “work“. The second title will have

[0 0 0 1 1 1]

Next we need to do is go through every class (the assigned users, for example, such as “anton kropp“, and “other developer“) and divide the count of each word by the total number of words found. This gives you the probability that a certain word occurs in a certain class.

Work through an example

So lets say I have a document class that represents what I’m using to train my data. The document class is generic, it doesn’t really care what it’s representing.

Now to train the data. We group the documents by class, then create a vector representing the occurance of each word in the space for all the documents in that class. I’m initializing the first vector to use all 1’s instead of 0’s since we are going to multiply the probability of each word together. If any probability is zero (i.e a word wasn’t found), then we’ll get a zero for the probability, which isn’t really true.

You may notice the logarithm stuff going on. That’s to prevent numeric underflow. Since probabilities will be small decimals, multiplying them all will make them smaller. By taking the logarithm we can maintain the same relative shape of the function, but they get shifted into a different number space. Now multiplying them won’t give us underflow.

Classifying

To classify, we need to multiply the vocab vector of a document by each classes probability vector and add the probabilities up. Whatever class gives us the highest probability is the class we predict we are

Success rate

The next step is to test the success rate. If we have a bunch of documents that are already categorized we can run them back through the classifier. At that point all that’s required is to see if the classifier classifies a document properly or not. The success rate is important, because if we have a low success rate of classifying documents then either the training set was poor, or the classifier needs other tuning. Without testing the success rate you can’t know how accurate your predictions would be.

Back to fogubgz

Remember, I want to see if I can auto triage cases. To do that I need to train the bayes classifier with some known data. Thankfully, I have over 50,000 cases to use as a my training set. Fogbugz makes it easy to pull data back since they have an exposed web api you can hook into that returns back neatly formatted XML. First log in:

This pulls back 1500 cases that I opened with the colums of title, project, area, assigned to, priority, all the edit events, who opened it, and who resolved it. I chose 1500 only to not have to wait to pull back all 50,000+ cases (though I probably could have). Also I am limiting my searches to individual users who have opened cases. By doing this I may be able to get some insight in what kind of cases people make. Maybe cases I make are more likely to be assigned to myself. By limiting my training set to a specific category like this I’m inherently creating a pivot.

After parsing the xml, all you need to do is transform your fogbugz case to a generic document (that the naive bayes classifier I wrote wants). ToDoc lets me adjust what is the class (in this scenario who the case is assigned to) and what is the predictor text (the case title). Generalizing the transform lets me reapply the same runner for different combinations of input.

Conclusion

While I can’t publish our internal bug information, I found that the classifier ranged in success from as low as 30% to as high as 80% depending on which user I pulled data back from. Personally, cases I opened had a 72.27% success rate classifying their assignment based soley on case title. That’s not too bad! It’d be neat to tie this classifier into a fogbugz plugin and generate a first pass auto triage report for our triager to use. When they’ve reassigned cases we can re-train the set periodically and hopefully get better results.

Full source including bayes classifier and fogbugz parsing available here.