Introduction

Welcome to the fifth and final chapter in a five-part series about machine learning.

In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications.

Unsupervised Learning

Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable ( label ), or one chooses not to designate a response. It can also be used as a pre-processing step for supervised learning.

In the unsupervised case, the goal is to discover patterns, deep insights, understand variation, find unknown subgroups (amongst the variables or observations), and so on in the data. Unsupervised learning can be quite subjective compared to supervised learning.

The two most commonly used techniques in unsupervised learning are principal component analysis ( PCA ) and clustering . PCA is one approach to learning what is called a latent variable model , and is a particular version of a blind signal separation technique. Other notable latent variable modeling approaches include expectation-maximization algorithm (EM) and Method of moments 3.

PCA

PCA produces a low-dimensional representation of a dataset by finding a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. Another way to describe PCA is that it is a transformation of possibly correlated variables into a set of linearly uncorrelated variables known as principal components.

Each of the components are mathematically determined and ordered by the amount of variability or variance that each is able to explain from the data. Given that, the first principal component accounts for the largest amount of variance, the second principal component the next largest, and so on.

Each component is also orthogonal to all others, which is just a fancy way of saying that they’re perpendicular to each other. Think of the X and Y axis’ in a two dimensional plot. Both axis are perpendicular to each other, and are therefore orthogonal . While not easy to visualize, think of having many principal components as being many axis that are perpendicular to each other.

While much of the above description of principal component analysis may be a bit technical sounding, it is actually a relatively simple concept from a high level. Think of having a bunch of data in any amount of dimensions, although you may want to picture two or three dimensions for ease of understanding.

Each principal component can be thought of as an axis of an ellipse that is being built (think cloud) to contain the data (aka fit to the data), like a net catching butterflies. The first few principal components should be able to explain (capture) most of the data, with the addition of more principal components eventually leading to diminishing returns.

One of the tricks of PCA is knowing how many components are needed to summarize the data, which involves estimating when most of the variance is explained by a given number of components. Another consideration is that PCA is sensitive to feature scaling, which was discussed earlier in this series.

PCA is also used for exploratory data analysis and data visualization . Exploratory data analysis involves summarizing a dataset through specific types of analysis, including data visualization, and is often an initial step in analytics that leads to predictive modeling, data mining, and so on.

Further discussion of PCA and similar techniques is out of scope of this series, but the reader is encouraged to refer to external sources for more information.

Clustering

Clustering refers to a set of techniques and algorithms used to find clusters (subgroups) in a dataset, and involves partitioning the data into groups of similar observations. The concept of ‘similar observations’ is a bit relative and subjective, but it essentially means that the data points in a given group are more similar to each other than they are to data points in a different group.

Similarity between observations is a domain specific problem and must be addressed accordingly. A clustering example involving the NFL’s Chicago Bears (go Bears!) was given in chapter 1 of this series.

Clustering is not a technique limited only to machine learning. It is a widely used technique in data mining, statistical analysis, pattern recognition, image analysis, and so on. Given the subjective and unsupervised nature of clustering, often data preprocessing, model/algorithm selection, and model tuning are the best tools to use to achieve the desired results and/or solution to a problem.

There are many types of clustering algorithms and models, which all use their own technique of dividing the data into a certain number of groups of similar data. Due to the significant difference in these approaches, the results can be largely affected, and therefore one must understand these different algorithms to some extent to choose the most applicable approach to use.

K-means and hierarchical clustering are two widely used unsupervised clustering techniques. The difference is that for k-means, a predetermined number of clusters (k) is used to partition the observations, whereas the number of clusters in hierarchical clustering is not known in advance.

Hierarchical clustering helps address the potential disadvantage of having to know or pre-determine k in the case of k-means. There are two primary types of hierarchical clustering, which include bottom-up and agglomerative.

Here is a visualization, courtesy of Wikipedia, of the results of running the k-means clustering algorithm on a set of data with k equal to three. Note the lines, which represent the boundaries between the groups of data.

There are two types of clustering, which define the degree of grouping or containment of data. The first is called hard clustering , where every data point belongs to only one cluster and not the others. Soft clustering , or fuzzy clustering on the other hand refers to the case where a data point belongs to a cluster to a certain degree, or is assigned a likelihood (probability) of belonging to a certain cluster.

Method comparison and general considerations

What is the difference then between PCA and clustering? As mentioned, PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance, while clustering looks for homogeneous subgroups among the observations.

An interesting point to note is that in the absence of a target response, there is no way to evaluate solution performance or errors as one does in the supervised case. In other words, there is no objective way to determine if you’ve found a solution. This is a significant differentiator between supervised and unsupervised learning methods.

Machine learning is often interchanged with terms like predictive analytics, artificial intelligence, data mining, and so on. While machine learning is certainly related to these fields, there are some notable differences.

Predictive analytics is a subcategory of a broader field known as analytics in general. Analytics is usually broken into three sub-categories: descriptive, predictive, and prescriptive.

Artificial intelligence (AI) is a super exciting field, and machine learning is essentially a sub-field of AI due to the automated nature of the learning algorithms involved. According to Wikipedia, AI has been defined as the science and engineering of making intelligent machines , but also as the study and design of intelligent agents , where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success

Statistical learning is becoming popularized due to Stanford’s related online course and its associated books: An Introduction to Statistical Learning , and The Elements of Statistical Learning .

Machine learning arose as a subfield of artificial intelligence, statistical learning arose as a subfield of statistics. Both fields are very similar, overlap in many ways, and the distinction is becoming less clear over time. They differ in that machine learning has a greater emphasis on prediction accuracy and large scale applications, whereas statistical learning emphasizes models and their related interpretability, precision, and uncertainty.

Lastly, data mining is a field that’s also often confused with machine learning. Data mining leverages machine learning algorithms and techniques, but also spans many other fields such as data science, AI, statistics, and so on.

The overall goal of the data mining process is to extract patterns and knowledge from a data set, and transform it into an understandable structure for further use. Data mining often deals with large amounts of data, or big data .

Machine Learning in Practice

As discussed throughout this series, machine learning can be used to create predictive models, assign classifications, make recommendations, and find patterns and insights in an unlabeled dataset. All of these tasks can be done without requiring explicit programming.

Machine learning has been successfully used in the following non-exhaustive example applications1:

Spam filtering

Optical character recognition (OCR)

Search engines

Computer vision

Recommendation engines, such as those used by Netflix and Amazon

Classifying DNA sequences

Detecting fraud, e.g., credit card and internet

Medical diagnosis

Natural language processing

Speech and handwriting recognition

Economics and finance

Virtually anything else you can think of that involves data

In order to apply machine learning to solve a given problem, the following steps (or a variation) should to be taken, and should use machine learning elements discussed throughout this series.

Define the problem to be solved and the project’s objective. Ask lots of questions along the way!

Determine the type of problem and type of solution required.

Collect and prepare the data.

Create, validate, tune, test, assess, and improve your model and/or solution. This process should be driven by a combination of technical (stats, math, programming), domain, and business expertise.

Discover any other insights and patterns as applicable.

Deploy your solution for real-world use.

Report on and/or present results.

If you encounter a situation where you or your company can benefit from a machine learning-based solution, simply approach it using these steps and see what you come up with. You may very well wind up with a super powerful and scalable solution!

Summary

Congratulations to those that have read all five chapters in full! I would like to thank you very much for spending your precious time joining me on this machine learning adventure.

This series took me a significant amount of time to write, so I hope that this time has been translated into something useful for as many people as possible.

At this point, we have covered virtually all major aspects of the entire machine learning process at a high level, and at times even went a little deeper.

If you were able to understand and retain the content in this series, then you should have absolutely no problem participating in any conversation involving machine learning and its applications. You may even have some very good opinions and suggestions about different applications, methods, and so on.

Despite all of the information covered in this series, and the details that were out of scope, machine learning and its related fields in practice are also somewhat of an art. There are many decisions that need to be made along the way, customized techniques to employ, as well as use creative strategies in order to best solve a given problem.

A high quality practitioner should also have a strong business acumen and expert-level domain knowledge. Problems involving machine learning are just as much about asking questions as they are about finding solutions. If the question is wrong, then the solution will be as well.