Occam’s razor and machine learning

In the last instalment of this blog series, we discussed objectives and accuracy in machine learning. And we described two crucial tests for the utility of a machine learning model: The model must be sufficiently accurate and we must be able to deploy the model so that it can produce actionable outputs from the available data. We then introduced a real-world scenario — predicting train failures up to 36 hours in advance of their occurrence using sensor data — to illustrate the application of those tests.

But how did we decide which of the multitude of machine learning algorithms to use to train our model in the first place? To answer this question, we need to revisit the main classes of machine learning algorithms.

As we explained in the second instalment of this blog, machine learning algorithms mainly fall into two categories: supervised learning algorithms and unsupervised learning algorithms (for the purposes of simplicity, we will ignore additional categories like semi-supervised learning and reinforcement learning). There are many algorithms available in each of these categories that can be used for either prediction/classification (in the case of supervised learning) or clustering/segmentation (in case of unsupervised learning).

With supervised learning, labeled data from the past is used to train a model that can then be used to predict future, similar events. If the label is a continuous variable (e.g., the revenue of a certain product or the number of products sold) algorithms like regression, special decision trees, random forests or neural networks can be used. If the label is a categorical variable (e.g., true or false), techniques like logistic regression, naïve Bayes classifiers, decision trees or the k-nearest neighbour algorithm are useful.

Unsupervised learning, on the other hand, operates on unlabeled data. Typically, we use unsupervised methods to identify structures and patterns in data that we didn’t know existed before — a process that is often termed “discovery analytics”. If the input data is numerical, the most common set of techniques is cluster analysis. If we are looking at categorical inputs, algorithms like association or affinity analysis can be used, for example, to discover which products are frequently bought in combination with one another in the course of different shopping missions.

Choose your machine learning algorithm

But how do we decide which of these algorithm is most useful for a given problem? The answer is to start with the problem — or rather with the business question that we want to answer through the application of machine learning. What is it that we want to achieve? And how will we measure the success — or otherwise — of our analysis? Accuracy may be (and often is) one of the important success criteria, but there are many more criteria that are also often important. Are the modeling results stable over time (“robustness”)? How long does it take to build and test the model (“speed”)? Can the model handle growing data volumes (“scalability”)? Is the model using as few parameters as possible (“simplicity”)? How easy are the model results/patterns to understand (“interpretability”)? Etc., etc., etc.

Considering these criteria — and understanding which of them are most important in a particular situation — can go a long way towards helping you select the “best” algorithm with which to go after a particular problem.

For example, if we compare a decision tree with an artificial neural network (ANN) for a particular domain using these criteria, we might identify the trade-offs illustrated schematically below.

The ANN may give us better predictive accuracy compared with the decision tree, but the decision tree scores better for simplicity and interpretability. Depending on the business problem and the context, then, we may decide to use a tree — even though the predictions likely won’t be as accurate — if presenting our findings and helping a business stakeholder to understand the relationships found in the data are key considerations for that particular analysis.

The last two success criteria, simplicity and interpretability, are especially interesting to consider, as they are often in tension with one of the subjects of our previous blog — accuracy. And this finally brings us to the title of this blog: Occam’s razor and machine learning.

William of Ockham, a Franciscan friar who studied logic in the 14th century, gave his name to the principle sometimes called lex parsimoniae, which is Latin for “the law of briefness”. William of Ockham supposedly wrote it in Latin as: “Entia non sunt multiplicanda praeter necessitate”, which roughly translates as “More things should not be used than are necessary”.

Applied in the context of machine learning, this means that if two algorithms have broadly similar performance for the criteria identified as the most important for a particular project — accuracy and stability, say — we should always prefer the “simpler” one.

But what does “simpler” mean in this context? We think that simpler should generally be taken to mean the algorithm that is least complex to deploy (because, for example, it uses fewer variables that have required less feature engineering to create) and that is easiest to interpret.

Let’s revisit our example of using machine learning to predict train failures 36 hours in advance of their occurrence. Recall that accuracy, which we measured by calculating type one and type two errors on the hold-out data set — was an important success criteria for the project. But of equal importance in this particular case was gaining an improved understanding of the root causes of train failures so that component and system design could be improved and so that we could establish if failures were the result of specific operating conditions that could be avoided in future. Had accuracy alone been the most important criteria, we might have used an ANN or a deep learning approach. In fact, to ensure that we could address the second criteria, we actually started with a relatively simple decision tree algorithm. In short, we traded some model accuracy for results that were easy for our client to understand and to interpret.

In fact, Occam’s razor not only guided us in the choice of algorithm for this project, but also in how we trained it, so that we built several trees using different numbers of (transformed) sensor readings and settled on the model that met our accuracy threshold whilst using the fewest variables.

Note that these kinds of trade-offs typically only make sense when the accuracy of the simpler model is at least in the same range as that of the more complex one. In a different scenario — one in which half a percentage point in accuracy might translate into millions of dollars of additional revenue or cost savings, say — you might happily opt for the model that is harder to interpret, or that requires more time to develop and deploy.

Key takeaways

We’ve said it before, and we’ll no doubt say it again on this blog: You should always start any machine learning project with a relentless focus on the business question that you want to answer and by formulating the key success criteria for the analysis. Assuming all other key criteria are (roughly) equal, then apply Occam’s razor and chose the model that is simplest to interpret, to explain, to deploy and to maintain.

In other words, prefer the simplest model that is sufficiently accurate, but ensure that you know the problem space well enough to know what “sufficiently accurate” means in practice. Because as Einstein, perhaps Occam’s greatest disciple, once said: “Everything should be made as simple as possible, but not simpler”.

Martin is a Senior Director in Teradata’s Go-To Market organisation, charged with articulating to prospective customers, analysts and media organisations Teradata’s strategy and the nature, value and differentiation of Teradata technology and solution offerings.Martin has 21 years of experience in the IT industry and is listed in dataIQ’s “Big Data 100” as one of the most influential people in UK data-driven business. He has worked for 5 organisations and was formerly the Data Warehouse Manager at Co-operative Retail in the UK and later the Senior Data Architect at Co‑operative Group.

Since joining Teradata, Martin has worked in Solution Architecture, Enterprise Architecture, Demand Generation, Technology Marketing and Management roles. Prior to taking-up his current appointment, Martin led Teradata’s International Big Data CoE – a team of Data Scientists, Technology and Architecture Consultants tasked withassisting Teradata customers throughout Europe, the Middle East, Africa and Asia to realise value from their Big Data assets.

Martin is a former Teradata customer who understands the Analytics landscape and marketplace from the twin perspectives of an end-user organisation and a technology vendor. His Strata (UK) 2016 keynote can be found at: https://www.oreilly.com/ideas/the-internet-of-things-its-the-sensor-data-stupid and a selection of his Teradata Voice Forbes blogs can be found online, including this piece on the importance – and the limitations – of visualisation.

Martin holds a BSc (Hons) in Physics and Astronomy from the University of Sheffield and a Postgraduate Certificate in Computing for Commerce and Industry from the Open University. He is married with three children and is a lapsed supporter of Sheffield Wednesday Football Club. In his spare time, Martin enjoys playing with technology,flying gliders, photography and listening to guitar music.

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.

Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

Data Science Tidings is a leading media platform for Data Science Evangelists and entrepreneurs, dedicated to delivering interesting innovative curated stories from the Data Science world. It aims to provide useful and latest curated feed on Data Science. It is a great destination to find the most fresh updates and murky strategies you have missed.