How To Compare Machine Learning Algorithms in Python with scikit-learn

It is important to compare the performance of multiple different machine learning algorithms consistently.

In this post you will discover how you can create a test harness to compare multiple different machine learning algorithms in Python with scikit-learn.

You can use this test harness as a template on your own machine learning problems and add more and different algorithms to compare.

Let’s get started.

How To Compare Machine Learning Algorithms in Python with scikit-learnPhoto by Michael Knight, some rights reserved.

Choose The Best Machine Learning Model

How do you choose the best model for your problem?

When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics.

Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use these estimates to choose one or two best models from the suite of models that you have created.

Compare Machine Learning Models Carefully

When you have a new dataset, it is a good idea to visualize the data using different techniques in order to look at the data from different perspectives.

The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two to finalize.

A way to do this is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.

In the next section you will discover exactly how you can do that in Python with scikit-learn.

Compare Machine Learning Algorithms Consistently

The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data.

You can achieve this by forcing each algorithm to be evaluated on a consistent test harness.

In the example below 6 different algorithms are compared:

Logistic Regression

Linear Discriminant Analysis

K-Nearest Neighbors

Classification and Regression Trees

Naive Bayes

Support Vector Machines

The problem is a standard binary classification dataset from the UCI machine learning repository called the Pima Indians onset of diabetes problem. The problem has two classes and eight numeric input variables of varying scales.

The 10-fold cross validation procedure is used to evaluate each algorithm, importantly configured with the same random seed to ensure that the same splits to the training data are performed and that each algorithms is evaluated in precisely the same way.

Each algorithm is given a short name, useful for summarizing results afterward.

You should not just rely on this. Accuracy is only one portion of a model’s accuracy. Depending on the investigation desired, you should look at Precision and Recall because accuracy may be only a tiny portion. Also, why have you not taken an approach with ANOVA, or the Wilcoxon Test, major tests within the realm of data science and widely accepted? Additionally, 5×2 cross-validation should be done, not 10-fold (this is widely accepted). Last, what I find completely missing in this is that you have not discussed how to actually arrive at a statistically-significant decision. This is not a good representation.

Eric, nobody cares about your phd, whatever it is you did it in. Also, stop calling it ANOVA when all you’re doing is a regression, it doesn’t make you any smarter. And lastly, nobody cares about your phd and your academic research, this is a machine learning article for Data Scientists.

Big fan of your tutorials. I have a question regarding the compare first then tune approach. When we plot them on a box plot and select the best, this is all based on the default model setting right? But once we have tune the different settings in a given model, would the predictive performance be different?

So the not so good models might even outperform the best model given in the first glance boxplot, if we have trained them more properly. So in this sense, wouldn’t it better to train every module separately, and then say, connect all of them and plot their ROC to see which performs best?

AUC can be examined on an ROC or Precision vs Recall curve. What should happen is weights based on misclassification, in a confusion matrix. In this case, you can tune a model to avoid certain misclassifications, as some may be more valuable to avoid. If you have zero care about which misclassification occurs, ROC is a decent metric for how you should tune parameters. ROC should be examined for hyperparameter decisions.

To answer my own question, it appears that each model is trained and tested for all folds before moving on to the next model. The seed applies to the initial state so for the above, the 10 folds will all be different from one another, but the same data split for each of the 10 folds will be presented to each algorithm.

Thank you for sharing.
I had to tweak the code a little to make it work with scikit-learn 0.18.
The cross_validation module is deprecated. It’s replaced by model_selection.
The KFold parameters have changed too:
0.17: cross_validation.KFold(n, n_folds=3, shuffle=False, random_state=None)
0.18: model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

I have a question: is it ok to train the classifier before adding it to the list? Like:
lr = LogisticRegression()
lr.fit(X_train, y_train)
models.append((‘LR’,lr))

What a great article! I learned so much from your writing 🙂
I also read your other article comparing different algorithms in R, and I noticed that you used a lot more techniques in that article:
• Table Summary
• Box and Whisker Plots
• Density Plots
• Dot Plots
• Parallel Plots
• Scatterplot Matrix
• Pairwise xyPlots
• Statistical Significance Tests
I was wondering why you did not provide the same techniques in this Python article? Is it because these functions are more readily available in R?
Thanks so much!

Hi Jason. Thank you for these great articles. I also read this article of yours (https://goo.gl/v71GPT). What I wonder is the proper validation method. Should we conduct k-fold or repeated n*k-fold cross validation? I recently read a journal article where researchers compare around 50 models under 5*2-folds setting, suggesting it is more robust. How should we proceed while comparing models?

Using k-fold cross validation is a gold standard. The specific configuration is problem specific, but common configurations of 3,5, 10 do well on many datasets.

On very large datasets, a train-test split may be sufficient. For complex or small datasets, if you have the resources, repeated k-fold cross validation is preferred. Often, we would like to use repeated k-fold cross validation, but the computational expense is too high.

There is no “best”, just lots of options to tune for your given problem.

Thanks a lot for this good article.
Could you please give some interpretations of the standard deviation values?
Especially regarding overfitting.
I thought that in case we have a small standard deviation of the cv results, we will have more overfitting, but I am not sure about that.

So standard deviation summarizes the spread of the distribution, assuming it is Gaussian.

A tight spread may suggest overfitting or it may not, but we can only be sure by evaluating the model on a hold out dataset.

One use of the stdev is to specify a confidence interval for the result. For example, the performance of the model is x% on unseen data, with the performance in the range of 2 standard deviations of that score (95th percentile).

Following Othmane’s question, shouldn’t we work with the standard error of the mean instead of the standard deviation? Basically, divide the standard deviations by sqrt(10). This is because “The standard error (SE) of a statistic (most commonly the mean) is the standard deviation of its sampling distribution”. https://en.wikipedia.org/wiki/Standard_error

Hi, from the boxplot, we get LR and LDA to have higher accuracy, so we select them as our models.
So now, can I apply train_test_split to check the RMSE and the accuracy for the testing data using both these models. Whichever gives the best result, I will make that my final model?

There are many ways to choose a final model. Often we prefer a model with better average performance rather than better absolute performance, this is because of the natural variance in the estimation of performance of the models on unseen data.

Once you choose a final model, train it on all available data and you can start to use it to make predictions.

I have started learning and implementing Machine learning algorithms.
One question – the above blog will tell us which Machine learning algorithm to go with. However, Should we ever check that if we are using Regression, how well the regression fits the data by checking
Autocorrelation, Multicollinearity and normality.

What I have learnt from reading blogs and articles that we all calculate score by using cross validation methodology, and then find out which would fit best. have not seen anyone following traditional ways such as checking Autocorrelation, Multicollinearity and normality. I might be wrong. Please throw some light on the same.
THanks Nitesh

Hi Jason! First of all thanks for all your blog posts, they are really helping me to better understand how to work with datasets and machine learning algorithms.

I’ve a question related to the scoring method. Before discovering the method you are using here, I was using the .score() method in this way (assume I already have splitted the dataset 80/20 and tranformed the data):

Hi,
Thank you so much for this tutorial. It really helps one using Machine Learning in sklearn.
One questio: I’mm trying to use this code with my dataset but I have features which are strings and not numbers, as in your dataset.
What can I do to change the code in order for it to work? (I’m getting an error saying “could not convert string to float”

Hello, Dr. Jason! Thank you so much for this wonderful article. I have a question for you. I noticed you have not mentioned feature selection and feature engineering in the Python mini course. So, my question is that if we were to implement both of these tasks, what should be the order with respect to this present stage of spot checking and comparison of machine learning algorithms? Should we first select one or two best performing models after comparison and then implement feature selection and feature engineering or first implement them and then perform spot checking and model comparison?

Thank you for answering! But, that “generally” is really not helping. Can you please explain in which cases it is suggested to do that and in which cases not? I think it’s really important for me to learn.