Predictive Analytics

What it is and why it matters

Predictive analytics is the use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data. The goal is to go beyond knowing what has happened to providing a best assessment of what will happen in the future.

Predictive Analytics History & Current Advances

Though predictive analytics has been around for decades, it's a technology whose time has come. More and more organizations are turning to predictive analytics to increase their bottom line and competitive advantage. Why now?

Growing volumes and types of data, and more interest in using data to produce valuable insights.

Faster, cheaper computers.

Easier-to-use software.

Tougher economic conditions and a need for competitive differentiation.

With interactive and easy-to-use software becoming more prevalent, predictive analytics is no longer just the domain of mathematicians and statisticians. Business analysts and line-of-business experts are using these technologies as well.

Why is predictive analytics important?

Organizations are turning to predictive analytics to help solve difficult problems and uncover new opportunities. Common uses include:

Optimizing marketing campaigns. Predictive analytics are used to determine customer responses or purchases, as well as promote cross-sell opportunities. Predictive models help businesses attract, retain and grow their most profitable customers.

Improving operations. Many companies use predictive models to forecast inventory and manage resources. Airlines use predictive analytics to set ticket prices. Hotels try to predict the number of guests for any given night to maximize occupancy and increase revenue. Predictive analytics enables organizations to function more efficiently.

Reducing risk. Credit scores are used to assess a buyer’s likelihood of default for purchases and are a well-known example of predictive analytics. A credit score is a number generated by a predictive model that incorporates all data relevant to a person’s creditworthiness. Other risk-related uses include insurance claims and collections.

Predictive Analytics in Today's World

With predictive analytics, you can go beyond learning what happened and why to discovering insights about the future. Learn how predictive analytics shapes the world we live in.

Got a predictive analytics skills gap?

This e-book from SAS includes real-world advice from employers and educators on finding, keeping and motivating top analytics talent.

Who's using it?

The financial industry, with huge amounts of data and money at stake, has long embraced predictive analytics to detect and reduce fraud, measure credit risk, maximize cross-sell/up-sell opportunities and retain valuable customers. Commonwealth Bank uses analytics to predict the likelihood of fraud activity for any given transaction before it is authorized – within 40 milliseconds of the transaction initiation.

Since the now infamous study that showed men who buy diapers often buy beer at the same time, retailers everywhere are using predictive analytics to determine which products to stock, the effectiveness of promotional events and which offers are most appropriate for consumers. Staples analyzes consumer behavior to provide a complete picture of their customers, and realized a 137 percent ROI.

Whether it is predicting equipment failures and future resource needs, mitigating safety and reliability risks, or improving overall performance, the energy industry has embraced predictive analytics with vigor. Salt River Project is the second-largest public power utility in the US and one of Arizona's largest water suppliers. Analyses of machine sensor data predicts when power-generating turbines need maintenance.

Governments have been key players in the advancement of computer technologies. The US Census Bureau has been analyzing data to understand population trends for decades. Governments now use predictive analytics like many other industries – to improve service and performance; detect and prevent fraud; and better understand consumer behavior. They also use predictive analytics to enhance cybersecurity.

In addition to detecting claims fraud, the health insurance industry is taking steps to identify patients most at risk of chronic disease and find what interventions are best. Express Scripts, a large pharmacy benefits company, uses analytics to identify those not adhering to prescribed treatments, resulting in a savings of $1,500 to $9,000 per patient.

For manufacturers it's very important to identify factors leading to reduced quality and production failures, as well as to optimize parts, service resources and distribution. Lenovo is just one manufacturer that has used predictive analytics to better understand warranty claims – an initiative that led to a 10 to 15 percent reduction in warranty costs.

Putting the Magic in the Magic

Sports analytics is a hot area, thanks in part to Nate Silver and tournament predictions. The NBA’s Orlando Magic uses SAS predictive analytics to improve revenue and determine starting lineups. Business users across the Orlando Magic organization have instant access to information. The Magic can now visually explore the freshest data, right down to the game and seat.

Predictive Text Analytics

How It Works

Predictive models use known results to develop (or train) a model that can be used to predict values for different or new data. Modeling provides results in the form of predictions that represent a probability of the target variable (for example, revenue) based on estimated significance from a set of input variables.

This is different from descriptive models that help you understand what happened, or diagnostic models that help you understand key relationships and determine why something happened. Entire books are devoted to analytical methods and techniques. Complete college curriculums delve deeply into this subject. But for starters, here are a few basics.

There are two types of predictive models. Classification models predict class membership. For instance, you try to classify whether someone is likely to leave, whether he will respond to a solicitation, whether he’s a good or bad credit risk, etc. Usually, the model results are in the form of 0 or 1, with 1 being the event you are targeting. Regression models predict a number – for example, how much revenue a customer will generate over the next year or the number of months before a component will fail on a machine.

Three of the most widely used predictive modeling techniques are decision trees, regression and neural networks.

Regression (linear and logistic) is one of the most popular method in statistics. Regression analysis estimates relationships among variables. Intended for continuous data that can be assumed to follow a normal distribution, it finds key patterns in large data sets and is often used to determine how much specific factors, such as the price, influence the movement of an asset. With regression analysis, we want to predict a number, called the response or Y variable. With linear regression, one independent variable is used to explain and/or predict the outcome of Y. Multiple regression uses two or more independent variables to predict the outcome. With logistic regression, unknown variables of a discrete variable are predicted based on known value of other variables. The response variable is categorical, meaning it can assume only a limited number of values. With binary logistic regression, a response variable has only two values such as 0 or 1. In multiple logistic regression, a response variable can have several levels, such as low, medium and high, or 1, 2 and 3.

Decision trees are classification models that partition data into subsets based on categories of input variables. This helps you understand someone's path of decisions. A decision tree looks like a tree with each branch representing a choice between a number of alternatives, and each leaf representing a classification or decision. This model looks at the data and tries to find the one variable that splits the data into logical groups that are the most different. Decision trees are popular because they are easy to understand and interpret. They also handle missing values well and are useful for preliminary variable selection. So, if you have a lot of missing values or want a quick and easily interpretable answer, you can start with a tree.

Neural networks are sophisticated techniques capable of modeling extremely complex relationships. They’re popular because they’re powerful and flexible. The power comes in their ability to handle nonlinear relationships in data, which is increasingly common as we collect more data. They are often used to confirm findings from simple techniques like regression and decision trees. Neural networks are based on pattern recognition and some artificially intelligent processes that graphically “model” parameters. They work well when no mathematical formula is known that relates inputs to outputs, prediction is more important than explanation or there is a lot of training data. Artificial neural networks were originally developed by researchers who were trying to mimic the neurophysiology of the human brain.

Other Popular Techniques You May Hear About

Bayesian analysis. Bayesian methods treat parameters as random variables and define probability as "degrees of belief" (that is, the probability of an event is the degree to which you believe the event is true). When performing a Bayesian analysis, you begin with a prior belief regarding the probability distribution of an unknown parameter. After learning information from data you have, you change or update your belief about the unknown parameter.

Ensemble models. Ensemble models are produced by training several similar models and combining their results to improve accuracy, reduce bias, reduce variance and identify the best model to use with new data.

Gradient boosting. This is a boosting approach that resamples your data set several times to generate results that form a weighted average of the resampled data set. Like decision trees, boosting makes no assumptions about the distribution of the data. Boosting is less prone to overfitting the data than a single decision tree, and if a decision tree fits the data fairly well, then boosting often improves the fit. (Overfitting data means you are using too many variables and the model is too complex. Underfitting means the opposite – not enough variables and the model is too simple. Both reduce prediction accuracy.)

Incremental response (also called net lift or uplift models). These model the change in probability caused by an action. They are widely used to reduce churn and to discover the effects of different marketing programs.

K-nearest neighbor (knn). This is a nonparametric method for classification and regression that predicts an object’s values or class memberships based on the k-closest training examples.

Partial least squares. This flexible statistical technique can be applied to data of any shape. It models relationships between inputs and outputs even when the inputs are correlated and noisy, there are multiple outputs or there are more inputs than observations. The method of partial least squares looks for factors that explain both response and predictor variations.

Principal component analysis. The purpose of principal component analysis is to derive a small number of independent linear combinations (principal components) of a set of variables that retain as much of the information in the original variables as possible.

Support vector machine. This supervised machine learning technique uses associated learning algorithms to analyze data and recognize patterns. It can be used for both classification and regression.

Time series data mining. Time series data is time-stamped and collected over time at a particular interval (sales in a month, calls per day, web visits per hour, etc.). Time series data mining combines traditional data mining and forecasting techniques. Data mining techniques such as sampling, clustering and decision trees are applied to data collected over time with the goal of improving predictions.

What do you need to get started using predictive analytics?

The first thing you need to get started using predictive analytics is a problem to solve. What do you want to know about the future based on the past? What do you want to understand and predict? You’ll also want to consider what will be done with the predictions. What decisions will be driven by the insights? What actions will be taken?

Second, you’ll need data. In today’s world, that means data from a lot of places. Transactional systems, data collected by sensors, third-party information, call center notes, web logs, etc. You’ll need a data wrangler, or someone with data management experience, to help you cleanse and prep the data for analysis. To prepare the data for a predictive modeling exercise also requires someone who understands both the data and the business problem. How you define your target is essential to how you can interpret the outcome. (Data preparation is considered one of the most time-consuming aspects of the analysis process. So be prepared for that.)

After that, the predictive model building begins. Increasingly easy-to-use software means more people can build analytical models. But you’ll still likely need some sort of data analyst who can help you refine your models and come up with the best performer. And then you might need someone in IT who can help deploy your models. That means putting the models to work on your chosen data – and that’s where you get your results.

Predictive modeling requires a team approach. You need people who understand the business problem to be solved. Someone who knows how to prepare data for analysis. Someone who can build and refine the models. Someone in IT to ensure that you have the right analytics infrastructure for model building and deployment. And an executive sponsor can help make your analytic hopes a reality.