Email Classification: The Road to Production

By Jack Lawton

In my previous blog, I introduced our latest project – an email classification system for large UK utilities supplier – and laid down a few basic principles that text classification solutions often rely on, such as structured and unstructured data. If you’re not up to speed with the story thus far, I recommend you checkpart one out now!

Now I want to explore our email classification model in much greater depth. While I’m at it, I’ll also show you how bringing a model like this into production can sometimes be more difficult than developing it…

What problem were we facing?

Originally, every single customer service email to our client was being sent to the same email address, and hence arriving in the same Outlook inbox. For a smaller company, this wouldn’t have been an issue – but for a large business like our client’s, we were dealing with about 30,000 emails per day.

Every single one of these emails had to be manually checked and forwarded to the appropriate inbox, so that it could be answered by a specialised team. This, of course, was a huge waste of time and human resource. It also meant response teams had to work different hours, to give human classifiers enough time to separate their emails. If a high priority email came at a bad time, it was unlikely anyone would be available to respond to it in a timely manner.

Although there were many topics, we noticed that emails could be classified into one of nine broad categories. Predicting for just these nine categories met the requirements of the business and enabled us to push for higher levels of accuracy. I therefore started to develop a classification solution with ten buckets: one for each of the top nine topics, plus one for “Other” topics.

Back to reality

When building a machine learning solution, it naturally makes sense to optimise it so that it predicts the correct result as often as possible. However, in the real world, not all emails are made equal: it was critical for me to ensure as few complaints as possible were misclassified.

Equally, in the real world, it is never possible to achieve 100% accuracy, but there are ways we can get closer for critical categories. For this project, we boosted complaint accuracy by adjusting the certainty threshold (the percentage confidence level a model must have to tag an email as a complaint). However, one must tread carefully to do this so as not to affect the overall model too much:

In the above graph, the x-axis represents the certainty of an email being a complaint. The y-axis shows the percentage accuracy for the three lines:

Complaints Identified – The percentage of complaints correctly identified, this must be as high as possible.

Other Categories Identified – The percentage of all other categories identified correctly, whist not as business critical, we don’t really want this to drop too low either.

True Complaints in Bucket – A measure for emails falsely identified as complaints, complaints statistics are used for regulatory purposes, so this must be as high as possible.

Naturally, if we accepted that everything over 0% was a complaint, we would catch 100% of the complaints, but 0% of everything else. Getting this right is a delicate balancing act and we can never have the best of both worlds.

I needed to plan for the inevitable margin of error. So how about 10%? This seems smart as it strikes the balance between accuracy on complaints and on everything else – however only 25% of emails in the complaints bucket would actually be complaints, so this result would be unacceptable. What if we set the threshold at 90%? This would result in an unacceptable amount of complaints going undetected.

The compromise was to introduce a new category: “possible complaints”. This can flag anything between 10% and 90% for human review. While this means we do not fully remove the human element, it means we can automate 95% of the work while getting much closer to 100% overall accuracy.

Putting a plan in action

If you have ever developed a machine learning solution for business, you will know that developing an accurate model is only half the battle. Bringing machine learning into production is a huge challenge of its own.

For one thing, we must consider the future of the model. Knowing that patterns and trends in emails can change over time, the model was designed to regularly monitor its own performance and identify mistakes - a technique known as reinforcement learning. That way, it can learn over time and stay on top of its game with a minimal amount of human involvement.

To support this reinforcement learning approach, one challenge was how to continually log inbound emails and model performance in a live system. To do this, I chose Elasticsearch as my data store. Elasticsearch hit two birds with one stone. As it is a search engine at heart, it is the perfect environment to deal with free-text fields (handy when it comes to emails). Also, Elasticsearch is ideal for server logging and performance statistics.

Bringing several machine learning techniques together alongside Microsoft Exchange and Elasticsearch is no easy task. But, using the flexible Python programming language alongside careful planning and testing, I was able to bring it all together much easier than anticipated.

Next, we required a dedicated server to run the service. While this may sound simple, in organisations with new data science teams, this can be a tricky one to get past IT security. You’ll need to consider every permission your server will need, which can add a lot of time at the end of a project. Fortunately, for email classification, it was relatively simple, as all we had to do was simply hook into Microsoft Exchange.

The next problem was how to build confidence in the solution. When solving problems that have depended on humans for so long, such as text classification, there is always a degree of skepticism within the business. To tackle this, we opted for a light roll-out, at first only tagging emails to build confidence, before getting the green light to switch on the automated classification.

Emerald City

I view every project as a learning experience, and this one was no different. I hope that you find my discussion here useful when planning and implementing your own machine learning projects. Every single aspect of a solution, from the model itself, right through to the implementation is equally important to me. Because, at the end of the day, there is nothing more satisfying than seeing that first email moved, not by an analyst, but because a business has put their faith in your solution.