Classification in Python with Scikit-Learn and Pandas

Introduction

Classification is a large domain in the field of statistics and machine learning. Generally, classification can be broken down into two areas:

Binary classification, where we wish to group an outcome into one of two groups.

Multi-class classification, where we wish to group an outcome into one of multiple (more than two) groups.

In this post, the main focus will be on using a variety of classification algorithms across both of these domains, less emphasis will be placed on the theory behind them.

We can use libraries in Python such as scikit-learn for machine learning models, and Pandas to import data as data frames.

These can easily be installed and imported into Python with pip:

$ python3 -m pip install sklearn
$ python3 -m pip install pandas

import sklearn as sk
import pandas as pd

Binary Classification

For binary classification, we are interested in classifying data into one of two binary groups - these are usually represented as 0's and 1's in our data.

We will look at data regarding coronary heart disease (CHD) in South Africa. The goal is to use different variables such as tobacco usage, family history, ldl cholesterol levels, alcohol usage, obesity and more.

Support Vector Machines

Support Vector Machines (SVMs) are a type of classification algorithm that are more flexible - they can do linear classification, but can use other non-linear basis functions. The following example uses a linear classifier to fit a hyperplane that separates the data into two classes:

Random Forests

Random Forests are an ensemble learning method that fit multiple Decision Trees on subsets of the data and average the results. We can again fit them using sklearn, and use them to predict outcomes, as well as get mean prediction accuracy:

Neural Networks

Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. These essentially use a very simplified model of the brain to model and predict data.

We use sklearn for consistency in this post, however libraries such as Tensorflow and Keras are more suited to fitting and customizing neural networks, of which there are a few varieties used for different purposes:

Multi-Class Classification

While binary classification alone is incredibly useful, there are times when we would like to model and predict data that has more than two classes. Many of the same algorithms can be used with slight modifications.

Additionally, it is common to split data into training and test sets. This means we use a certain portion of the data to fit the model (the training set) and save the remaining portion of it to evaluate to the predictive accuracy of the fitted model (the test set).

There's no official rule to follow when deciding on a split proportion, though in most cases you'd want about 70% to be dedicated for the training set and around 30% for the test set.

Although the implementations of these models were rather naive (in practice there are a variety of parameters that can and should be varied for each model), we can still compare the predictive accuracy across the models. This will tell us which one is the most accurate for this specific training and test dataset:

Model

Predictive Accuracy

Logistic Regression

46.1%

Support Vector Machine

64.07%

Random Forest

57.58%

Neural Network

54.55%

This shows us that for the vowel data, an SVM using the default radial basis function was the most accurate.

Conclusion

To summarize this post, we began by exploring the simplest form of classification: binary. This helped us to model data where our response could take one of two states.

We then moved further into multi-class classification, when the response variable can take any number of states.

We also saw how to fit and evaluate models with training and test sets. Furthermore, we could explore additional ways to refine model fitting among various algorithms.