Adventures in AI part 3: How do I know my ML model is any good?

You can train any model you like for your AI solution, but it is only as good as the data you feed it. But how do you measure the performance of a machine learning model? In this third part of the series on AI models I will show you how.

Validation of machine learning models is different

Machine models are a different kind of beast to work with. I've build regular software solutions for quite some time now and found that rule based solutions are way easier to validate.

Why is it so different to test a machine learning model then you ask? Well, a machine learning model is based on statistics, random factors and probabilistic approaches. This makes for a very volatile mixture. You can't look at an AI solution and pinpoint the exact problem as you would with a regular software solution.

Some developers would argue that it is impossible to test machine learning models. I think that there are some approaches that will give you the proper performance metrics that you need to be able to use your machine learning model with confidence.

There are two ways in which you can test machine learning models. First you can build unit-tests. But unit-tests for AI solutions don't test the actual performance of the model that you're using. It can only test whether data flows through the model correctly and that the model doesn't raise any exceptions.

To measure if the model predicting the correct output, you need a different test approach. In fact it is not so much a correctness test, but rather a performance test.

By using the validation techniques in this post you will be able to answer the question: How far off are we with our predictions?

Measuring performance

There's not one method to measure the performance of your machine learning model. For each kind of model there's a different technique that will tell you how well your model performs.

I will talk you through some of the most used metrics there are in machine learning. These metrics work for regular machine learning as well as deep learning models.

Measuring performance of regression models

For regression models for example you need to measure the distance between the predicted value and the actual value that the model was trained to predict.

To do this you usually use a score Mean Square Error (MSE). This expresses the average amount by which the model will be off.

The larger the output of the Mean Square Error formula, the worse the model performs.

For example, if you're predicting the value of a house in dollars, MSE will tell you how many dollars you will be off by on average (squared though, so values will be high).

Notice that you should both R-squared and MSE with care. If there are many outliers in the input data, will still be off by quite a big margin.

One shining example of where a model can go wrong is Anscombe's quartet. Both performance metrics that I've shown above will not help you when you have a situation as described in Anscombe's quartet.

To get the best results possible, follow these three steps:

Check your data for outliers and drop them if you can

Visualize your data and check for signs that something's wrong.

Use the performance metrics I've shown you.

Measuring performance of classification models

To validate a classification model you need to use different performance metrics. There's no relative measure of correctness. Instead you are either correct or not on a predicted label.

To measure the performance of a classifcation model you need to use a confusion matrix.

Actual positive

Actual negative

Predicted positive

True positive

False positive

Predicted negative

False negative

True negative

Using this table you can calculate the accuracy of your model:

$$
Accuracy = \frac{TP+TN}{TP+TN+FP+FN}
$$

Measuring multi-class classification models

Accuracy is only useful for measuring performance of a binary classifier. If you have multiple labels in your data that you want to predict you need more precise tools.

Accuracy still works, but is rather crude. It only measures the overall number of cases that you predicted correctly. For a more accurate score you need to take a look at how well the classifier does for each label individually.

The best way to measure performance of a multi-class classifier is to measure accuracy per label.

This is a good metric for classification models if you return just one result to the user of your model. In the case of recommendations this doesn't work.

Measuring performance of recommendation systems

Recommendations are a special case of classifcation model. You predict whether someone likes an item or not. You typically show a number of recommended items instead of just one. This begs the question, how good is my model in this case?

There are two more measures important here. Precision and Recall.
Precision tells you how many of the items you show are actually relevant to the user.

$$
Precision = \frac{TP}{TP+FP}
$$

The second measure that is important for recommendations is Recall. How many of the relevant items from the whole dataset are ever shown to the user. For example, if in theory there are 10 relevant books in the database that are proven relevant. How many of those books are shown to the user?

$$
Recall=\frac{TP}{FN+TP}
$$

You cannot maximize precision and recall at the same time. You have to make a tradeoff in your model. Do you want to show many relevant items? You have to increase the total number of items displayed. Thus decreasing the precision. Want to show only relevant items? Decrease the number of items shown and decrease the recall again.

A good way to visualize this behavior is the area under the curve metric. This shows you just how well the recommendation system is working. For more information take a look at this wikipedia article.

Accounting for false positives and false negatives

Looking at absolute performance figures for classifiers is fine in general. But it doesn't account for some nasty scenarios. For example: When you predict fraud cases you want to minimize the amount of false positives. Or if you want to minimze the chance that you've missed fraud cases you may want to minimize the false negatives.

To account for this you need to make use of the F1 score. The basic formula for the F1 score is like this:

$$
{F1-score} = \frac{precision*recall}{precision+recall}
$$

It takes both precision and recall in account. Remember, recall is the measure of how many true positives are actually picked up by the classifier. In this case how many fraud cases it actually detects. The precision tells you how many of the cases that were presented as positive are true positives.

You can change this formula by including an additional factor. I call this the tuning factor.

You can tune the beta variable between 0 and 1. When you set the value to 1 you want to maximize recall. When you set the value of beta to 0.5 you focus on precision instead.

The test/validation split

Now that you've seen the metrics for measuring performance, let's talk about the famous train/validation split. A lot of people will tell you it is important to split your data in a training and validation set. But why should you?

To understand why you should split your input data we have to go back to how a model learns.

A machine learning model learns by finding rules or parameters that best fit the training data. For example if you want to predict the price of a house you'd typically use a regression model that looks like this:

To get the best predictions possible you're going to need to find values for w1, w2 and w3 that fit the training data best.

There's two things that happen while you're trying to find the optimal weights in a model. First the model tries to memorize the optimal weights so that it best matches the training data. Second by learning the weights that fit the training data, rules are discovered that also work for similar but unseen data.

The memorization effect of a machine learning algorithm is rather unfortunate as this reduces the chance that it finds parameters that work for similar, but unseen data. This is called overfitting.

You want to balance between memorization and generalization in your model.
So when you validate your model you will test for two things:

Does the model learn properly from the training data

Is the final model general enough to fit similar, but unseen data

For this you need to run the validation metrics for two sets of data. First you use the training data for learning the parameters. You can then run the performance metrics with the same data to get an initial performance measurement. This gives you a measure of how well the model learns. Second you need a separate set of data, that is not used for training, to see whether the model generalizes to a more usable solution. This is the validation set.

As a general rule of thumb you want to split the data in 80% training data and 20% validation data.

To perform this split you need to randomly select samples for both sets. Why random? If you pick the top 80% of your training samples without using randomness, you may be looking at a set that doesn't cover the whole range of possible input values for your model to learn from. So you should use random selection for splitting the data.

But be careful when you split your data. When you have a classification model you want to split the data so that there's a balanced set of classes represented in the training and the test data. Remember that when you provide a very limited set of samples for a single class, the model isn't going to be able to learn sufficiantly good rules to classify data for that class.

Model selection

Notice that in contrast to common wisdom I haven't talked about train/validation/test sets. You will encounter this trio of datasets quite often on the internet. The reason is this.

Typically when you are working on a machine learning solution you are going through these three phases:

Prepare the data for machine learning

Train different models and select the model that works best for your data

Perform a final validation on the model to make sure that it works in production

I mentioned that you need to split your data in a training and validation set. This split is required to detect problems with overfitting. When you train a single model on the dataset that's enough and you shouldn't set apart a third set of samples for testing.

However, if you're training multiple models because you're unsure of what model will work. Then this third set of data becomes more important.

All models should be validated using the same samples to get a proper comparison of the performance of the different models you've trained.

To make this work, you need to first split your data in a 80% training set and 20% test set. For each model you train after this you can take the 80% training set and split randomly split it in 80% training and 20% validation data. The initial 20% is fixed.

Again, be careful how you split the data. Make sure that you have an equal amount of samples for each class you want to predict.

Additionally, I would not randomly split the initial dataset but spend some time picking the right test samples. You want to make sure that there's a particularly good spread in those test samples. This is, after all, the final quality measure for your model. Get this wrong and you maybe wondering why people complain about your AI solution.

Cross validation for even better metrics

The different validation techniques in this post will help you get started to validate your machine learning models. And the test/validation split will help you detect overfitting scenarios.

However, all of the performance measures are relative. The accuracy, precision, MSE and other metrics will vary across runs. That's because there's a random factor involved.

The random factor in machine learning comes from the very first step in the process of learning the rules for a model. Because you can't start with nothing you have to pick random values for the parameters in your model. This causes the model to go in a random direction when you perform the next step in the learning process.

Randomness isn't a big problem, because ultimately this is solved by providing enough training samples so a set of general rules emerge.

However for validating a model this means that you will see fluctuations in the performance.

Cross validation can counter-act this effect a little bit. By performing multiple training/validation runs on a single type of model with different sets of training/test data you generate enough metrics that allow you to come up with a mean value for the performance metrics of your model. This evens out the random effect quite a bit.

Note, this too isn't a 100% guarantuee that your model works. I would only use this if you have doubts about the performance of your model and want to make sure that the random factor doesn't play in your disadvantage.

Human in the loop

Ultimately, your machine learning model isn't going to be a 100% effective. But by using proper validation techniques you can make sure that the results are as close to 100% as they can be.

There's one additional step that I want to mention here. Please, make sure you ask a human what he/she thinks of your model. User feedback is the ultimate test of how well your model is doing.

The people that use your solution know what to expect and are almost always better at picking the right label or setting the right output value for a set of inputs.

Conclusion

With the right techniques in your toolbox you should now be able to test your AI solution and make sure that it meets user standards.

So in short, when you want to build an AI solution, follow these steps to measure its performance:

Pick the right validation strategy: Ask yourself, will I be using model selection in my project? If so, use the appropriate data processing to get the right datasets for training, validation and testing.

Pick the right performance metrics: Ask yourself, what kind of model am I testing. Are false positives a bad thing? Or do I want to be as precise as possible?

Ask the people that use the model. User feedback is the most important performance metric of all.