Who Is This Mini-Course For?

The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

Developers that know how to write a little code. This means that it is not a big deal for you to pick up a new programming language like Python once you know the basic syntax. It does not mean you’re a wizard coder, just that you can follow a basic C-like language with little effort.

Developers that know a little machine learning. This means you know the basics of machine learning like cross-validation, some algorithms and the bias-variance trade-off. It does not mean that you are a machine learning Ph.D., just that you know the landmarks or know where to look them up.

This mini-course is neither a textbook on Python or a textbook on machine learning.

It will take you from a developer that knows a little machine learning to a developer who can get results using the Python ecosystem, the rising platform for professional machine learning.

Mini-Course Overview

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hard core!). It really depends on the time you have available and your level of enthusiasm.

Below are 14 lessons that will get you started and productive with machine learning in Python:

Lesson 1: Download and Install Python and SciPy ecosystem.

Lesson 2: Get Around In Python, NumPy, Matplotlib and Pandas.

Lesson 3: Load Data From CSV.

Lesson 4: Understand Data with Descriptive Statistics.

Lesson 5: Understand Data with Visualization.

Lesson 6: Prepare For Modeling by Pre-Processing Data.

Lesson 7: Algorithm Evaluation With Resampling Methods.

Lesson 8: Algorithm Evaluation Metrics.

Lesson 9: Spot-Check Algorithms.

Lesson 10: Model Comparison and Selection.

Lesson 11: Improve Accuracy with Algorithm Tuning.

Lesson 12: Improve Accuracy with Ensemble Predictions.

Lesson 13: Finalize And Save Your Model.

Lesson 14: Hello World End-to-End Project.

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the Python platform (hint, I have all of the answers directly on this blog, use the search feature).

I do provide more help in the early lessons because I want you to build up some confidence and inertia.

Hang in there, don’t give up!

Lesson 1: Download and Install Python and SciPy

You cannot get started with machine learning in Python until you have access to the platform.

Today’s lesson is easy, you must download and install the Python 3.6 platform on your computer.

Visit the Python homepage and download Python for your operating system (Linux, OS X or Windows). Install Python on your computer. You may need to use a platform specific package manager such as macports on OS X or yum on RedHat Linux.

You also need to install the SciPy platform and the scikit-learn library. I recommend using the same approach that you used to install Python.

You can install everything at once (much easier) with Anaconda. Recommended for beginners.

Start Python for the first time by typing “python” at the command line.

Check the versions of everything you are going to need using the code below:

Lesson 2: Get Around In Python, NumPy, Matplotlib and Pandas.

As a developer, you can pick-up new programming languages pretty quickly. Python is case sensitive, uses hash (#) for comments and uses whitespace to indicate code blocks (whitespace matters).

Today’s task is to practice the basic syntax of the Python programming language and important SciPy data structures in the Python interactive environment.

Practice assignment, working with lists and flow control in Python.

Practice working with NumPy arrays.

Practice creating simple plots in Matplotlib.

Practice working with Pandas Series and DataFrames.

For example, below is a simple example of creating a Pandas DataFrame.

1

2

3

4

5

6

7

8

# dataframe

import numpy

import pandas

myarray=numpy.array([[1,2,3],[4,5,6]])

rownames=['a','b']

colnames=['one','two','three']

mydataframe=pandas.DataFrame(myarray,index=rownames,columns=colnames)

print(mydataframe)

Lesson 3: Load Data From CSV

Machine learning algorithms need data. You can load your own data from CSV files but when you are getting started with machine learning in Python you should practice on standard machine learning datasets.

Your task for today’s lesson is to get comfortable loading data into Python and to find and load standard machine learning datasets.

Lesson 6: Prepare For Modeling by Pre-Processing Data

Your raw data may not be setup to be in the best shape for modeling.

Sometimes you need to preprocess your data in order to best present the inherent structure of the problem in your data to the modeling algorithms. In today’s lesson, you will use the pre-processing capabilities provided by the scikit-learn.

The scikit-learn library provides two standard idioms for transforming data. Each transform is useful in different circumstances: Fit and Multiple Transform and Combined Fit-And-Transform.

There are many techniques that you can use to prepare your data for modeling. For example, try out some of the following

Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options.

Normalize numerical data (e.g. to a range of 0-1) using the range option.

Explore more advanced feature engineering such as Binarizing.

For example, the snippet below loads the Pima Indians onset of diabetes dataset, calculates the parameters needed to standardize the data, then creates a standardized copy of the input data.

Lesson 7: Algorithm Evaluation With Resampling Methods

The dataset used to train a machine learning algorithm is called a training dataset. The dataset used to train an algorithm cannot be used to give you reliable estimates of the accuracy of the model on new data. This is a big problem because the whole idea of creating the model is to make predictions on new data.

You can use statistical methods called resampling methods to split your training dataset up into subsets, some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data.

Your goal with today’s lesson is to practice using the different resampling methods available in scikit-learn, for example:

Split a dataset into training and test sets.

Estimate the accuracy of an algorithm using k-fold cross validation.

Estimate the accuracy of an algorithm using leave one out cross validation.

The snippet below uses scikit-learn to estimate the accuracy of the Logistic Regression algorithm on the Pima Indians onset of diabetes dataset using 10-fold cross validation.

Lesson 8: Algorithm Evaluation Metrics

There are many different metrics that you can use to evaluate the skill of a machine learning algorithm on a dataset.

You can specify the metric used for your test harness in scikit-learn via the cross_validation.cross_val_score() function and defaults can be used for regression and classification problems. Your goal with today’s lesson is to practice using the different algorithm performance metrics available in the scikit-learn package.

Practice using the Accuracy and LogLoss metrics on a classification problem.

Practice generating a confusion matrix and a classification report.

Practice using RMSE and RSquared metrics on a regression problem.

The snippet below demonstrates calculating the LogLoss metric on the Pima Indians onset of diabetes dataset.

Lesson 9: Spot-Check Algorithms

You cannot possibly know which algorithm will perform best on your data beforehand.

You have to discover it using a process of trial and error. I call this spot-checking algorithms. The scikit-learn library provides an interface to many machine learning algorithms and tools to compare the estimated accuracy of those algorithms.

In this lesson, you must practice spot checking different machine learning algorithms.

Which parameters achieved the best results? Can you do better? Let me know in the comments.

Lesson 12: Improve Accuracy with Ensemble Predictions

Another way that you can improve the performance of your models is to combine the predictions from multiple models.

Some models provide this capability built-in such as random forest for bagging and stochastic gradient boosting for boosting. Another type of ensembling called voting can be used to combine the predictions from multiple different models together.

106 Responses to Python Machine Learning Mini-Course

Thank you so much for these lessons. It got me started on Python very quickly. I’m currently in lesson 7. I tried LeaveOneOut resampling method and got: Accuracy: 76.823% (42.196%).
I have a question: how do we consider a model better than others: it seems accuracy is not enough, we should also consider lower variance in the cross validation, shouldn’t we?

Also, I have some issues on the LDA (error: Unknown label type). I read that it is a classifier, but since our output variable is not a class, we cannot use LDA in case of Boston Houses Prices dataset, am I right?

First of all, thank you for your quick replies! As I progress towards the last lessons, I have always this question: what happens if we don’t specify a value for ‘scoring’ in cross_val_score? The documentation says default value is ‘None’, but then how can we interpret the value of results.meant() if no scoring is specified?

Hi Jason. Thanks for ALL you do. I was doing “Lesson 7: Algorithm Evaluation With Resampling Methods”, when I ran into the following challenges running Python 35 with sklearn VERSION 0.18. :

c:\python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
“This module will be removed in 0.20.”, DeprecationWarning)

Regarding “Lesson 9: Spot-Check Algorithms”, I would like to know how can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE?

I have published a post on my blog titled “Naive Spot-Check of AI Algorithms” which references your work. The post generates 36 Spot-Check Cases, using (3 Datasets x 4 Models(Algorithms) x 3 Scorings). There were 11 out of 36 Cases that returned numerical results. The other 25 Cases returned Errors or Warnings.

Again, I would like to know how can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE?

Spot checking is to discover which algorithms look good on one given dataset. Not across datasets.

You may need to group algorithms by their expectations then prepare data for each group.

Most machine learning algorithms expect data to have numeric input values and an integer encoded or one hot encoded output value for classification. This is a good normalized view of a dataset to construct.

Thanks Jason! I did that and it worked. I’m actually using Anaconda so that I don’t have to install packages individually, but since it is using the latest Python (3.5.2) and you used a previous version, it isn’t running as smoothly.

This time I’m running into an issue of an unsupported operand (TypeError: unsupported operand type(s) for %: ‘NoneType’ and ‘tuple’) while trying to print accuracy results similar to Joe Dorocak, and his solution didn’t work for me. I’ll fiddle with it some more and hopefully I’ll find a fix.

Nevertheless, I got following for accuracy w/o formatting:
(76.951469583048521, 4.8410519245671946)

Third, I am implementing my own score function in order to compute multiple scoring metrics at the same time. It works well with all of them except for the log loss.I obtain high values (around 7.8) Here is the code:

Hi Jason, I have run through most of the Lesson in this posts and I have to say thank you for that. It has been a while since I’ve been wanting to dig in more into ML and your blog will definitely be of help from now on.
My results are:

I used the rescaled and standardised matrix X for all of my analysis.
My question is: How will I know if rescaling is actually working? Is that given by the context? I suppose in your code you calculated your statistics using the data as raw and unprocessed as possible…

Really great course! As someone just getting into machine learning, but knows how to code, this is the perfect level for me.

I had a quick question. I’m going through the Iris dataset, and spot-checking different algorithms the way you demonstrated (results = cross_val_score(model, rescaledX, Y, cv=kfold)), and one of the algorithms I’m checking is the Ridge algorithm.

Looking at the scores it returns:
Ridge Results: [ 0. 0. 0. 0.753 0. 0. 0.848 0. 0. 0. ], it seems to perform alright sometimes, then get 0 other times. How come there is so much variation in accuracy between the testing results?

I also found that for KFold, if I use ‘n_split = 9’, I can get better accuracy than the other values like ‘n_split = 10’ or ‘n_split = 8’ without other optimizations.(I mean I only changed the value of parameter ‘n_split’.)

So, here is the question: how can I save the model with the highest accuracy found during k-Fold cross validation? (that means I wanna save the model found when “n_split = 9” for later production use)
Because in my understanding, cross validation contains two functionatilities: training the model and evaluate the model.

Great site and great way to get started. Enjoying going through the mini-course!

I have a question on Lesson #9
KNN works as coded in your example with an Accuracy of -88 with my kfold parameters

Can I use LogisticRegression on this along with Accuracy scoring? When I tried using LogisticRegression Model on the Boston Housing Data sample in Lesson #9, I get a bunch of errors – ValueError: Unknown label type: ‘continuous’

I was trying to tune the parameters using a Random Search. So instead of using GridSearchCV I switched to RandomizedSearchCV, but am having difficulty setting the model (tried using Ridge as in the GridSearchCV example) and also the distribution parameters to try and tune the parameters for RandomizedSearchCV.

How should I go about setting the model and the param_grid for the RandomizedSearchCV?

Is there any difference between these two ways (of Decision Trees)? I think it will give same accuracy. But it gave different results when I used these two ways with same dataset.
(1)
num_instances = len(X)
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = DecisionTreeClassifier()
results = model_selection.cross_val_score(model,X,Y,cv=kfold)
print(“Accuracy: %.3f%% (%.3f%%)” %(results.mean()*100.0,results.std()*100.0))