The company I work for sends a lot of emails, and in return we get a lot of bounces.
We currently don't have a good way of sorting them and help us archive the routine ones while putting the important ones in front of human eyes.
From a format of view these email bounces are not well structured, but there are some sort of patterns.
There is a preliminary system that is being worked on but it is becoming an unmanagable collection of complicated regular expressions.
I convinced my company to let me take home a portion of the bounces, and I am hoping to find a better way to handle them in my free time. It's both an opportunity to learn something interesting and do something valuable.
How would you approach this problem? Is there a class of methods or algorithms that are a good fit for this situation?

Can you expand on what you are doing now? Do you embed any tracking codes into the email?
–
GrandmasterBMar 17 '12 at 6:44

The emails are being sent with unique identifier and tracking codes. I don't have access to the send report or collected data though. Currently, the bounces are being processed by a system written by another engineer. The system is based on regular expressions, but I feel that it followed a dead-end method. There are several hundreds regular expressions and you never know the impact of changing any one of them.
–
PierreMar 17 '12 at 16:06

1 Answer
1

Bayesian Classifier: Since you have a large set of data where you know the "correct" outcome (i.e., archived or seen by humans), you can use it to provide initial training for the classifier. As a test, you can run each item through the set and see if they're classified properly, then run a batch of unknown items and see how the classifier does on them.

Once you're feeding humans the classified items, it's important to adjust the system they use so there's a mechanism to provide feedback when the classifier made the wrong decision:

Seen by a human but should have been archived

Archived but should have been seen by a human (Use random spot checks of the archive to achieve this if things are never fished out of the archive to be human-processed.)

...or when it made the right decision (these should be implicit if you get no feedback on an item):

Seen by a human and should have been seen by a human

Archived and should have been archived.

Bayes classifiers need to be trained with both kinds of information, and get pretty good at being right as positive and negative examples pile up.

Scoring: This is probably what you're doing now with regular expressions and is where you integrate all of the human expertise that a classifier can't handle. Each item starts with a score of zero ("neutral") and each matched rule pulls it in a positive or negative direction depending on whether the rule means the item should be archived or seen. One of those rules should adjust the score based on what the classifier returns, applying a negative score for probabilities in [0.0,0.5), zero for 0.5 and a positive score for values in (0.5,1.0]. Once you get a handle on how well the classifier does, you can adjust the magnitude of the score adjustment based on where in the ranges the probability falls. Another thing you can do is lower the threshold for being seen so a bigger range of probabilities around the center (the "not sure" range) will be screened and you'll be able to collect feedback that will make the classifier better able to make decisions that fall out of that range.

Implementation: The good news is that you don't have to develop all of this yourself. Since you're working with email, SpamAssassin can do nearly all of the grunt work for you and is ripe for adaptation to fit your application. (With some clever re-packaging of the data, you can use it for applications that don't involve email, too.) All you need to do is tear out all of the built-in rules and substitute your own set. One of SpamAssassin's other handy features is that it can add reporting on what it matched directly to the headers of each item. That gives you the ability to use the feedback you collect to find what rules are most often involved in misclassifications and adjust them accordingly.

Side note: For applications that require classifying things n ways, run each bit of input through multiple filters and pick the one that returns the highest score.