Bill McDermott’s SAPPHIRE keynote focused on the Experience Economy – how businesses must enrich their financial and operational processes with customer feedback. To quote McDermott, “Experience is now the organizing principle of the global economy”. This blog will examine how experiential data or (X-Data), can be combined with operational data or (O-Data) and how SAP Analytics Cloud, in turn, supports the concurrent analysis of multiple data sources (“Data Blending”) to create more powerful analytic applications for the Experience Economy.

To show the power of blending X-Data and O-Data, we will review data from the 10 highest grossing movies of the summer. Our operationaldata set includes box office performance, ticket sales, and production cost data for each of the films. (“Information courtesy of Box Office Mojo. Used with permission.” – see more great movie analysis at boxofficemojo.com). For ourexperiential data set, social media postings for each movie are analyzed based on data collected from Twitter (i.e.- keywords, hashtags, and the Twitter handles). Utilizing a Machine Learning technique called Natural Language Processing (NLP), we’re able to understand the emotion present in each tweet (this is also called sentiment analysis) and by combining the NLP analysis and location data, we can then begin to understand how each market is reacting to the various films.

To account for the informal language used on Twitter, the NLP library used in this analysis weighs capitalization, punctuation, and emojis. Including these features in the analysis of sentiment allows it to be as accurate as possible.

Data Blending

SAP Analytics Cloud (SAC) allows any number of local or remote data sources to be present in a given story, allowing analysts to avoid the data silos which are often present in legacy reporting tools. SAC can join separate models together based on their corresponding dimensions in a process referred to as, Data Blending. This process makes it possible to draw conclusions from one model in relation to another and to identify patterns that can be used to establish a correlation between two sets of data. By linking the sentiment analysis model to the box office performance model, it is possible to analyze how the financial performance of a movie is influenced by the customer experience.

SAC makes it incredibly simple to couple models based on similar dimensions between them. For instance, we were able to link our Twitter and box office data models by their dimensions related to movie titles and dates.

By linking these two models and blending their data, visualizations can be created in SAC that displays information from both simultaneously. This allows us to discover connections on a larger scale, and focus on specific elements within each model. In the visualization below, the financial metrics of approximate ticket sales and studio revenue are contextualized and presented alongside a graphic related to the volume of posts made on Twitter and their overall sentiments. This uniform presentation makes it possible to then observe our X-Data and O-Data for individual movies and examine the relationship between their financial and social media metrics. For instance, the first image below shows data for the movie Booksmart, while the second shows the same data, but for the movie Brightburn.

Visualization displaying box office and social media data for the movie Booksmart.

Visualization displaying box office and social media data for the movie Brightburn.

Reporting

Combining these two streams of data with the modeling capabilities of SAC allows for the creation of robust visualizations, which can be used to more effectively examine relationships within the data and provide context for key business problems. These visualizations provide an avenue for furthering the insights generated by the models created in SAC and offer a means of integrating predictive analytics technology with already existing data.

Certain functionalities allow for the creation of charts showing financial measures, such as unit sales, and measure them against Twitter sentimentality. This allows for the analysis of patterns that may predict the performance of a product. For instance, in the case of the movie Pokémon Detective Pikachu, a high volume of tweets that were predominantly positive were immediately followed by a spike in ticket sales.

By mapping the number of Tweets posted that associated with specific films across days in the month of June, we can see how overall Twitter activity related to each movie varies as days go by. As demonstrated, this visual can display this data for the entire month of June but can be effortlessly adjusted to do so by day in order to offer more detailed insights. The visual language capabilities of SAC allow for this relationship to be expressed in a distribution chart that not only utilizes engaging aesthetic elements but takes advantage of these elements to communicate information directly and effectively.

Modeling in SAC also allows for geo-enrichment, after which data can be presented in dynamic map visuals such as the one above. This map contains two layers of information related to our X-Data and O-Data. The size of each bubble indicates how many tweets related to our set of movies were posted in each country, while the shading of each country indicates the average positivity of these tweets in that region. The visualization separates the data in a way that makes it easy to measure activity by region and displays it in a coherent and interesting manner. In the same way that other charts can be filtered to highlight relevant data, these maps can be altered to focus on specific geographical areas or movies.

Map filtered to show global social media activity related to only the Pokémon Detective Pikachu movie.

Conclusion

Within this blog, we have examined some of the powerful ways in which experiential data can be used to better understand operational performance. Using SAP Analytics Cloud, we can visualize these insights and easily follow trends over time as well as compare them to different metrics, driving “Experiential KPIs”. Collaborative Enterprise Planning with SAC replaces data and process silos with seamless collaboration to present and analyze data from different angles. The visualizations presented to allow for greater discovery of patterns and causation that may have been previously hidden in large spreadsheets, divided among separate department and buried in vast amounts of data. With the facilitated discovery, drawing insight becomes less of a challenge and easier to transform into planning and solutions. The experiential economy is won and lost on data insights, and the analysis shown today are just some of the ways to achieve those insights.

This blog was created by TruQua’s Summer 2019 Interns, Nicole Aragon, and Treaank Patnaik. Stay tuned for our upcoming “Movie Watch” blogs, where we will continue to analyze social media trends for this summer’s hottest films.

About our authors:

Nicole Aragon

Consulting Intern

Nicole Aragon is a Senior at the University of Texas at Austin, where she studies Management Information Systems in the McCombs School of Business. She plans on continuing her experience working in data science after college.

Treeank Patnaik

Consulting Intern

Treeank Patnaik is a Consulting Intern at TruQua, working on improving business intelligence by using advanced analytics and machine learning software. He is currently pursuing a degree in Mechanical Engineering at the University of Texas at Austin, along with certificates in Applied Statistical Modeling and Foundations of Business Administration.

JS Irick

Director of Data Science and Artificial Intelligence

JS Irick has the best job in the world; working with a talented team to solve the toughest business challenges. JS is an internationally recognized speaker on the topics of Machine Learning, SAP Planning, SAP S/4HANA and Software development. As the Director of Data Science and Artificial Intelligence at TruQua, JS has built best practices for SAP implementations in the areas of SAP HANA, SAP S/4HANA reporting, and SAP S/4HANA customization.

Editor’s Note

TruQua is currently hosting a closed beta for the Social Media analytics tools described in this article. If you are interested in joining the beta program, please contact js.irick@truqua.com

Welcome to TruQua’s New Perspectives blog series. This blog series provides a platform for our consultants to share their perspectives, insights, and experiences on emerging technologies such as SAP HANA and SAP Analytics Cloud.

Today we are speaking with Senior Cloud Architect Daniel Settanni. Daniel’s passion is bringing AI to the realm of corporate finance. With his deep expertise in time series analysis, cloud architecture and SAP planning solutions, Daniel is able to not only create impressive Machine Learning models but also deploy them in an integrated, accurate and explainable way. As you will see below, Daniel does an exceptional job at both demystifying the technologies behind AI/ML and “defeating the black box” through explainable AI.

Perspective 1: Getting Started with Machine Learning for Finance

JS Irick: Daniel, a few weeks back you and I hosted a pre-conference session at SAP-Centric where we led attendees through building their first Machine Learning model to predict OPEX (based on published financials for a Fortune 500 company). In that case, it was easy for our attendees to be successful because the problem was already defined. When it comes to the enterprise level, identifying the opportunity and success criteria is the first major roadblock customers face. Can you talk a little bit about how customers can set themselves up for success in their foundational ML projects?

Daniel Settanni: Sure JS. Figuring out where to start with ML can seem like an insurmountable challenge and for good reason. ML and AI pop up in the news every day. Topics like ML generated movie recommendations and AI created art are cool, but not relatable to most enterprises. On the flip side, Enterprise Software usually incorporates ML/AI in tasks that are common across wide swaths of the market that focus on automation. These are much more relatable, but don’t speak to how Enterprises can use ML to solve their own problems… from the mundane to the highly strategic.

So, now to your question.

If an Enterprise is just starting out with Machine Learning and is having a hard time finding the right opportunity, then pick something relatively simple like a basic forecasting project. These are a great place to start because the models can be kept simple, the predictions can be easily integrated with the existing business processes, and the results deliver real value (I haven’t met an Enterprise yet that isn’t looking to improve their forecasting accuracy in one area or another). But above all, projects like this provide real-world experience with the ins and outs of ML – plus they always generate a ton of new ideas on where to use ML next.

If the Enterprise has already identified the opportunity, then I’d make sure that their success criteria include delivering a way for users to interact with the model. This could be as simple as pulling predictions into an existing business system as a new Forecast version or entail developing a custom dashboard for what-if analysis. In any case, if the success criteria is simply to build an accurate model that never sees the world beyond the data science team, they will be losing out on the vast majority of ML’s true Enterprise value.

JS Irick: “But above all, projects like this provide real-world experience with the ins and outs of ML – plus they always generate a ton of new ideas on where to use ML next.” That’s definitely my favorite part of working on these projects with you. We get to see the lightbulb go on in real time which leads to fascinating strategic discussions. I believe that best consultants not only help their clients with their current project, but they also help chart the way forward through education and enablement.

Can you talk a bit more about “interacting with the model”, with some examples? I think this is important for folks just getting started with AI/ML.

Daniel Settanni: Absolutely. The main point here is that the more completely people can interact with something (an ML model in this case), the more they will understand it and the greater the understanding, the greater the potential for meaningful insights.

This “interaction” can look very different depending on the business problem being solved.

For example, if we built out a Forecasting model the minimum level of interaction would result in a spreadsheet. This isn’t a great option for a lot of reasons, the most basic of which is that it’s not in the same place as the related business data.

We can fix that by integrating our hypothetical Forecasting model with the related Enterprise Application. Now the forecasts can be viewed in context, but there isn’t any transparency around how the model came to the conclusions it did. The best case here is that the forecast is proved to be accurate at which point the business will just accept it – but being overly reliant on a model in this way is dangerous.

So, next, we’ll add explainability to our model. Based on this, our analysts gain insight into not only what the model predicted, but what why it arrived at the answer as well. Since our analysts are the true experts, this can lead to valuable feedback on ways to make the model better. Because there’s transparency, it can also become more of a trusted tool.

We’ve made a lot of progress from our spreadsheet, but we don’t have to stop there. We could make the model truly interactive by allowing analysts to tweak its assumptions and see its impacts on the prediction and explanations. At this point, you have what I like to call a Strategic model, one that can aid in making business decisions.

Before moving on, I’d like to highlight another example to show how this methodology can be applied to other areas. As you know, we built out an Employee Retention model last year that was integrated with SuccessFactors. The basic output of the model was the likelihood an employee would leave during the next year. The predictions were based on factors like salary growth, time at the company, historical promotions, etc.

To make this model most valuable, we didn’t stop at the raw prediction. We created a dashboard where HR analysts could actually predict the impact of interventions, greatly increasing the chance they could retain their top talent.

These are just a few examples of why I believe interaction is one of the core pillars of a successful, and valuable, ML-based solution.

Perspective 2: From Prediction to Explanation to Intervention

JS Irick: From my work in medical science, I’ve always felt that researchers first seek to understand so that you can intervene. While the moment of discovery is exciting, it pales in comparison to using that discovery to impact change. Machine Learning changes the paradigm a bit, in that first you predict, then you explain, then you intervene. This leads to two questions – first, how do you get predictions into analysts’ hands quickly enough so that there is time to intervene? Second, can you explain to our readers how you go from a raw numerical prediction to actionable insights?

Daniel Settanni: You bring up some great points here. Without insight, there can be no action and without action, there can be no value.

The answer to the first question is easy to answer, but not always simple to implement. To get predictions into Analysts hands quickly enough to intervene, a model must either be integrated with their business system or directly accessible in some other way. If Analysts have to wait for quarterly or even monthly reports, then they’ve probably missed their chance to act. On the other hand, allowing them to perform some degree of what-if analysis in real time can put them dramatically ahead of the curve.

One quick anecdote before I move on to question number two… in my experience, the initial success criteria for an ML project is accuracy. This makes complete and total sense but once you deliver an accurate model the next logical question is “Why”? No one likes a black box, and even more importantly, no one trusts a black box. Without some degree of understanding, trusting the results of an ML model can feel like a leap of faith and who wants to bet their career on that?

So how do you get from raw numerical predictions to actionable insights? It starts with deeply understanding the problem you are trying to solve and building your model around that (instead of just accuracy). This involves carefully selecting features (model inputs) that are related to the question at hand. Having relatable features can give analysts some confidence in a model, but adding in Explainable AI, a technology that figures out each features contribution to a prediction, can really deliver the trust needed to go from prediction to action.

JS Irick: Without getting too deep into the technical side, can you talk a bit more about feature selection? In a lot of ways, I lean on my research experience; which means I focus on explaining statistical malpractice “this will pollute your results because….”. I’d love to hear a positive, actionable take on the value of feature selection.

Daniel Settanni: You’ve picked a topic close to my heart. Before diving in, here’s a quick recap on what a feature is. In its most basic form, a Machine Learning model uses a bunch of inputs to predict an output. In Data Science lingo the inputs are called features and the output is called the label.

There’s one more topic we need to touch on before we can really talk about features… the different reasons models are created in the first place. This may sound like a silly question – we create models to predict things of course!

But… there are different types of predictions.

For example, I may want to create a model that can remove human bias from my Sales forecasts, or I may want a model that can accurately predict the impact of a certain business decision to my Sales forecast. In both cases, we’re forecasting the same thing, but our goals are very different.

How can we do this? The answer lies in feature selection.

In the first scenario (where we want to remove human bias), we would focus on factors outside the scope of control of the business. These would likely include macroeconomic data, sales trends, consumer (financial) health, and the like. By training a model with this type of features, we would be capturing the historic performance of a company. These types of models tend to be very accurate, especially for mature organizations.

In the second scenario, we want to do pretty much the opposite. Instead of focusing on things we can’t change to capture historical performance, we look at the things we can – so it’s much more of a microeconomic viewpoint. By adding explanations to the solution, a model like this can empower decision makers to get accurate information on the impacts to the bottom line of many decisions. That said, this model is going to be extremely vulnerable to human bias so while it can be an amazing strategic solution, it isn’t a great pure forecasting one.

And there’s no law that says all features have to be macro vs microeconomic. In fact, many are mashups if you will. So ultimately the key isn’t to match the features to what you’re predicting, it’s to match the features to the question you’re trying to answer.

Perspective 3: Into the Wild

JS Irick: As you know, many wonderful forecasting tools never make it out of the “validation” phase. Integration, maintenance, retraining, and continuous validation are critical for the long term health of any project, but especially an ML/AI project. Unsupervised predictive models tend to fail “silently”, in that there’s no production down moment. Our product Florence is one way for customers to ensure not only the best practices in model development but also long term model health. Can you talk a little bit about the challenges customers face and how Florence solves them?

Daniel Settanni: Glad to. ML/AI projects often focus, in some cases solely, on building an accurate model. This is a fine approach if you’re in a scientific setting, but in the Enterprise simply building the model is only a small piece of the puzzle.

To get the most value out of an Enterprise ML project, it has to be:

Accurate

Interactable

Explainable

A model alone can only deliver on accuracy.

To be interactable, the model has to be accessible in real-time. This means it has to be deployed somewhere and either integrated into an existing system or a net new application has to be created.

To be explainable, the appropriate technology must be deployed alongside the model and integrated into the prediction process.

The challenges that come with making a model interactable and explainable are considerable and often require ongoing collaboration with the DevOps and Development teams. I highlighted “ongoing collaboration” because this is commonly missed cost/risk. During the lifetime of an ML/AI project, its’ model(s) will likely have to be retrained many times. When a model gets retrained, the data preparation steps often have to update, and when that happens corresponding changes have to be made by the DevOps and Development teams. The worst part is if the changes aren’t made exactly right the models will keep on delivering predictions. They’ll just be less accurate, probably way less accurate. And if you’re making decisions off of those predictions, that could be very costly.

Most solutions only deliver on a few pieces of an ML/AI project, leaving it up to each customer to figure out everything else. We took a very different approach with Florence.

Florence covers the entire process, from creating accurate and explainable models to make them available in real-time, to provide the APIs and security necessary to integrate with practically any Enterprise system.

One of my favorite technological advances is the way Florence abstracts away things like data preparation, so all the Developers have to focus on is creating the best user experience and users can be confident that predictions aren’t wrong due to integration issues.

JS Irick: Excellently put. I’m a big believer in Eric Raymond’s “The Art of Unix Programming”. I find that the rules still hold up (also, it’s interesting that some of the proto-ML techniques are coming back into vogue). Some of the rules speak strongly to the strengths of ML – “Avoid hand-hacking; write programs to write programs when you can” and “Programmer time is expensive; conserve it in preference to machine time” come immediately to mind. However, you’ve touched on something that shoots up red flags – “When you must fail, fail noisily and as soon as possible”. Some of the toughest technical issues we face come when a system is failing silently, producing unintended consequences downstream. Especially when it comes to algorithms whose results are making financial decisions. Ask any futures trader, they’d rather the system crash than give incorrect responses due to a bug.

You hit the nail on the head when you noted that Florence applies the necessary data prep on the decision side as well as the training side. If numbers need to be scaled, normalized, etc. that should absolutely be on the server side. As a user, I get so salty when I hear things like “Oh, you forgot to turn your image into an array of integers before submitting it”. Let people speak in their language, and if there’s any data prep that needs to be done, it needs to be done in an abstracted, centralized way.

You’ve been doing some tremendous UX work in Florence recently, got a teaser for us?

Daniel Settanni: I’ve got the perfect images for this conversation. It’s of Florence’s model validation view for a Macroeconomic model.

The first screenshot shows the incredible accuracy we obtained, but perhaps, more importantly, the explainability Florence delivers. Information like this can drive extremely valuable insights, and with Florence it doesn’t come with any additional work – it’s baked right in.

Thank you so much for spending time with me today Daniel. I always learn a tremendous amount when we speak, and even better, I get fired up to build new things. Hopefully, our readers were both educated and inspired as well.

Daniel and I consistently share articles/podcasts/news on AI/ML topics, and we’d love it if you all joined the conversation. Be on the lookout for our upcoming weekly newsletter which will go over the most interesting content of the week.

About our contributors:

JS Irick

Director of Data Science and Artificial Intelligence

JS Irick has the best job in the world; working with a talented team to solve the toughest business challenges. JS is an internationally recognized speaker on the topics of Machine Learning, SAP Planning, SAP S/4HANA and Software development. As the Director of Data Science and Artificial Intelligence at TruQua, JS has built best practices for SAP implementations in the areas of SAP HANA, SAP S/4HANA reporting, and SAP S/4HANA customization.

Daniel Settanni

Senior Data Scientist and Cloud Architect

Daniel Settanni is living the dream at TruQua, using innovative technology to solve traditionally underserved Enterprise challenges. Making advanced technology more accessible with thoughtful design has been Daniel’s passion for years. He’s currently focused on Florence by TruQua, an innovative Machine Learning solution that delivers Interactive and Explainable AI in a fraction of the time, and cost, of any other product on the market.

Thank you for reading this installment in our New Perspectives blog series. Stay tuned for our next post where we will be talking to Senior Consultant Matt Montes on Central Finance and the road to S/4HANA for mature organizations.

For more information on TruQua’s services and offerings, visit us online at www.truqua.com. To be notified of future blog posts and be included on our email list, please complete the form below.

Daniel Settanni is a Senior Cloud Development Architect at TruQua Enterprises.

Artificial Intelligence (AI), Machine Learning (ML), Predictive Analytics, Blockchain – with so many emerging technologies (and the associated buzzwords), it can be a challenge to understand how they can fit into your business. Here’s a short primer to help anyone new to the topic make sense of Machine Learning in the Enterprise.

What is Machine Learning?

There are quite a few technical definitions of machine learning, but they all boil down to the same concept: Machine Learning is a technique that uses advanced math to learn how different pieces of data are related to each other.

Three of the best ways to use ML in the business world are:

Regression models learn how to make numeric predictions

Classification models learn how to classify samples into different known groups

Clustering models learn how to group samples all by themselves, i.e., without a predefined set of sample groups.

Let’s run through an example of when each could be used.

Predicting Sales Volume, a Regression example

Almost any organization that sells something goes through a process where they forecast future sales. This process usually entails analyzing some mix of historical performance, external influences (market conditions, etc.), and internal strategy (pricing changes, etc.)

The forecasts are often performed by different groups, like Sales and Operations, and then passed to the Management team for review.

So why might we want to use Machine Learning to forecast sales instead of staying with the existing procedure? One reason is to minimize bias. In this case, the bias we’re trying to remove is a human one. Forecasts have consequences that impact the very people creating the forecasts, so it’s only natural for bias to creep in.

By creating a Machine Learning model, we could remove the human bias and introduce additional value-adding features like explainability (why the Machine Learning model is predicting a number) and what-if analysis (what would happen to our sales forecast if one or more inputs were to change).

Detecting Fraud, a Classification Example

We live in a period of rapidly evolving, and expanding, Financial solutions targeted directly to consumers. These solutions are collectively known as the FinTech industry, and although each company’s offerings are different, they all share a common problem… Fraud. And this problem is by no means a new one. Fraudulent transactions originating within a company can be an even bigger problem than by outside actors.

So, what’s the best way to start solving the fraud problem?

We could hire a huge team of analysts to comb through every transaction looking for tell-tale signs of fraud. This solution is doomed to fail. It would be expensive, have varying effectiveness (depending on the analyst), slow down transaction throughput, and maybe worst of, do nothing about fraud originating within a company.

Another idea would be to have a smaller team analyze past fraudulent transactions, creating tests that could be used to programmatically check for fraud. This is certainly a better option, but with transaction volume (and the risk of internal fraud), it also has a number of drawbacks.

Machine Learning could go much further in solving this problem. By creating a model based on historical Fraudulent and non-Fraudulent transaction details, an organization could benefit from a mathematically sound analysis of much larger sets of data. And, since creating a model is much less labor intensive, it could be recreated (retrained in ML speak) to more rapidly identify new forms of fraud.

Customer Segmentation Analysis, a Clustering Example

In this age of technology, one thing organizations aren’t at a loss for is data. They have so much data that it can be difficult, if not impossible to analyze it all in any traditional respect. This can lead to huge potential losses both from strategic and monetary terms.

Clustering models are amazing in their ability to identify relationships within huge amounts of data that could otherwise go undetected. One concrete example of Clustering is Customer Segmentation Analysis. Here the model could group customers based on any number of patterns, such as purchasing behavior, product mix, and timing. These insights ultimately lead to a better understanding of any organizations most profitable asset… their customers.

Machine Learning Implementation Process

Now that you have a better understanding of what types of problems Machine Learning can solve, you might be wondering what it takes to implement. The basic process includes the following steps:

Understanding the problem that needs to be solved

Analyzing and preparing the data

Creating an accurate model

Integrating that model with existing systems and processes

Another key question you’ll need to ask is, who can do all of this? In some cases, a software vendor can deliver Machine Learning capabilities out-of-the-box. This works best when a problem is well defined and common within a specific business or industry process.

For example, SAP’s Cash Management Application is a perfect example of a solution that can harness the full benefits of machine learning because Cash Management challenges across organizations are so similar.

But what if an out-of-the-box solution doesn’t exist? This is where you’ll need to go a step further and employ the skills of a data scientist and an area where TruQua can help.

Conclusion

Machine Learning provides incredible possibilities for an organization to not only automate Finance Processes (through categorization and regression) but also to discover new insights from their data (through explainable AI and segmentation).

A successful ML implementation requires a cross-functional team that takes a critical look at everything from data selection and preparation to the way the model’s insights will be used. As such, it’s critical to not only pick the right problem but the right partner.

How TruQua can help

TruQua’s team of consultants and data scientists merge theory with practice to help our customers gain deeper insights and better visibility into their data. Whether it is scenario identification, building out your Machine Learning Models or integration with Finance systems, TruQua has the tools needed for success. So, if you’re looking to make more informed business decisions utilizing the latest predictive analytic and Machine Learning capabilities from SAP we’d love to hear from you.

CHICAGO, IL, APRIL 03, 2018— TruQua Enterprises, LLC, a leading SAP software and services firm, today announced the release of Florence, the first Machine Learning platform built to integrate with SAP Systems. Deployed on the cloud, Florence provides a platform for the simple, and secure deployment of Machine Learning algorithms, with secure connectivity to SAP business systems.

As Machine Learning capabilities evolve, so does the need to integrate Data Science, Finance and Logistics groups quickly, with minimal disruption to existing business processes, which is how the idea for Florence originated. TruQua developed Florence to simplify the usage of Machine Learning models within SAP. Utilizing the Florence platform, SAP technologies such as S/4HANA, SAP HANA, Cloud Applications and Business Intelligence applications can now integrate seamlessly with Machine Learning predictions, without the need for additional infrastructure or software licenses.

With the announcement of Florence, TruQua has created three targeted offerings designed to help customers on their Machine Learning journey. These offerings include:

First Steps with Florence – For customers who are just getting started with Machine Learning. A six-week engagement where customers work with Data Scientists to define the use case, build and initial model and integrate it with their business systems.

Targeted Business Need – A five-week engagement to create an initial model and integrate it with existing business systems.

Bridging Data Science and Finance- A two-week engagement to integrate Machine Learning models into a customer’s existing business systems.

About TruQua EnterprisesTruQua Enterprises is an IT services, consulting, and licensed SAP development partner that specializes in providing “True Quality” SAP solutions to Fortune 500 companies with integrated, end-to-end analytic solutions. Through project management, software innovation, thought leadership, implementation and deployment strategies, TruQua’s team delivers high value services through its proprietary knowledge base of software add-ons, development libraries, best practices, solution research and blueprint designs. TruQua has also been certified as a Great Place to Work and ranked #11 by Fortune Magazine’s “The 50 Best Companies to Work for in Chicago” in the Small and Medium Sized Companies category. For more information, please visit www.TruQua.com or follow us on twitter @TruQuaE.

In part one of “Key Business Factors in Machine Learning” (https://www.truqua.com/key-business-factors-machine-learning-part-1-predicting-employee-turnover/), we explored how Machine Learning can categorize data. We also reviewed the business’s role in model development. In this blog, we will look at creating Machine Learning algorithms to predict values. In particular, we will be looking at Sales Demand for Bicycle rentals. Divvy Bikes is Chicago’s own bike sharing system and, with over 10 million unique rides taken a year, the largest in North America.

The dataset used in this article combines all of Divvy Bike’s 10+ Million rides from 2015, along with hourly weather data and the Chicago Cubs schedule to observe the effect of external factors on rider traffic (for a different presentation on adapting Machine Learning models for discrete locations).

In this example, we are going to test three different popular Machine Learning technique: Machine Learning with Logistic Regression, Support Vector Machines and Random Forest algorithm models to predict the number of bikes that will be in service for a given hour of a Divvy station.

Refining the dataset

The Divvy Bike/Weather/Cubs dataset in this article is much more complicated than the Employee Attrition dataset in part 1, featuring over 55 different factors.

Two of the factors in the model can be generalized into groups to help more efficiently train the model. These factors express time as integers – Day of the Year and Day of the week. When expressing them as integer, their meaning is actually obscured from the model.

Certain algorithmic techniques can work around this obfuscation, but it can be much more efficient to perform an initial grouping to accelerate the model development.

Day of the week and day of the year have obvious groupings; however, your business data may have groupings that are not immediately obvious to the data scientists.

Here we see the impact of creating a “Season” category for day of the year:

Similarly, we can see that there is a large effect on demand based on the day of the week. Weekdays have a huge 5PM spike that is not seen on weekends. Therefore, we can greatly increase the initial accuracy of our models by changing day of the week into a Weekday/Weekend grouping.

Investigating the Data

Removing outliers can be a critical step in increasing model fit. However, it is important to define just what an outlier is in the context of your business process. For example, removing outliers from our bike ride dataset can help refine our model (the Fourth of July causes a demand spike which is obvious to anyone familiar with the US. The demand spike for the “Air and Water Show” would not be apparent to non-Chicagoans.). We don’t want our model to try and fit a factor that is not present in our model.

However, compare this to Fraud detection algorithms that exist only to detect outliers. If we were to remove the outliers from that model, we would end up with a 100% accurate, since all the fraudulent transactions were removed from the dataset.

However, compare this to Fraud detection algorithms that exist only to detect outliers. If we were to remove the outliers from that model, we would end up with a 100% accurate, since all the fraudulent transactions were removed from the dataset.

Modeling with the Dataset

Once the dataset has been prepared, it is time to develop, train and test with different Machine Learning algorithms.

In this example, we looked at three very different algorithms – Logistic Regressions, Support Vector Machines, and Random Forests.

Logistic Regression

Logistic Regression is a statistical method for analyzing a dataset where there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (MEDCALC).

Support Vector Machines

Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other (Wikipedia).

Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set (Wikipedia).

Modeling Results

As can be seen, Logistic Regression was the clear winner for this scenario, but the optimal model typically isn’t obvious at the start, and may even be a surprise at the end. Further accuracy gains required separating out the data by departure station, as there are significant model differences between each station (for example, downtown station usage is very resistant to changes in weather and is almost entirely dependent on the weekday/weekend grouping).

For more information on how you can put Machine Learning to work at your own organization, contact us today at info@truqua.com. Our team of consultants and data scientists are on-hand and ready to assist. For companies with robust data science organizations, we offer several project accelerators to easily and securely combine your business data with your Data Scientists’ Machine Learning algorithms.

As industry leaders ramp up their investments in Machine Learning, there is a growing need to communicate effectively with Data Scientists. Without a true understanding of both the technology and business factors involved in the Machine Learning scenario, it is impossible to create long term solutions.

In Part 1 of this 2-part blog series, we will work through the first of two Machine Learning examples and describe the communication and collaboration necessary to successfully leverage Machine Learning for business scenarios.

Machine Learning algorithms are very good at predicting outcomes for many different types of scenarios by analyzing existing data and learning how it relates to the known outcomes (what you’re trying to predict). Two of the most common types of machine learning algorithms are classification and regression.

With classification, the predicted values are fixed, meaning there are a limited number of outcomes, such as determining if a customer will make a purchase or not. Regressions on the other hand, make continuous numerical predictions, such as determining the lifetime value of a customer. In each case, it is critical that the Data Scientist understands both the inputs (the source of the individual factors and how they are created) and the business event you are trying to categorize or predict.

Categorizing Example: Employee Turnover

Understanding Machine Learning and business goals

First, let’s look at an example that demonstrates how to use Machine Learning to perform categorization. In this case, we are trying to better predict Employee Turnover. So, the goal of the machine learning algorithm is to categorize current employees as “Likely to Leave” or “Unlikely to Leave”. The categorization will be based on factors we have about each employee.

However, our goal is slightly different. Our business requirement is to identify the employees likely to leave so that actions can be taken to retain the employees. Before we continue, it is important to understand the cost of both a false positive and a false negative with regards to your business.

False Positive: An employee that is not going to leave is flagged as likely to leave.

False Negative: An employee leaves despite no indication from the machine learning algorithm.

In this case, False Negatives are costlier than False Positives. The algorithm with the best fit (overall performance) may not be the most effective for your business if it does not appropriately weigh the cost of the outcomes.

Communicating the available data with the Data Scientists

Machine learning algorithms need to be developed and trained on historical data, so for each historical employee we have features that we believe are related to whether an employee stays or leaves, as well as whether they remain at the company.

When undertaking a Machine Learning project, it’s critical to work with a partner who will take the time to understand the various features that can be used within the model. If the data scientist does not understand the inputs into the model, it is likely to end up with models that perform well in testing, but poorly in production. This is called “overfitting”.

This communication with the Data Scientist can also lead to the inclusion of additional valuable external data that were initially missing from the model.

Let’s look at the factors in the Employee Turnover dataset.

There are three important items to note here:

1. Satisfaction level is self-reported and people are notoriously poor self-reporters.
2. The job role column is labeled “sales” in the input dataset. While descriptive column names are nice, they are no replacement for a good data dictionary.
3. Salary is a simple “High/Medium/Low” value, but is not normalized for job role.

Refining the dataset

Once we have reviewed the factors, as well as the business event we are trying to model, we need to better understand how they relate to each other. An analysis should occur on the relationships between factors and results, as well as between individual factors. Here we see a chart describing the correlations between our various factors, and whether the employee stayed with the company.

When looking at the relationships, we start to understand the correlations between our data. This step should reveal a number of data relationships which make intuitive sense, and may show some surprising results.

1. Number of current projects and number of hours worked are related. [Intuitive]
2. Employees with a longer tenure are less likely to leave. [Intuitive]
3. There is a slight negative relationship between satisfaction and retention, [Surprising]

When looking at the relationships between data, we can also find highly correlated associations. This can help determine factors to either combine or remove.

Additionally, it is necessary to look at the numerical data to determine if we should change certain values to ranges/buckets. For example, look at the relationship between monthly hours and employee retention.

Note the monthly hours for employees that were not retained. This should make intuitive sense, as the only thing worse than working too much, is working too little. Rather than use monthly hours as a value, our model would be better served by defining categories for monthly hours.

Model Development

Once the data set has been analyzed, model development can begin. This is generally an iterative process, going through a number of different model types, as well as re-examining the initial data set.

While this iterative process is being performed, it is important to look at the output of the models, not just this fit. This is where the definition of your business goal, as well as communication with an experienced Data Scientist is critical. For example, a fraud detection algorithm that never detects fraud is over 99% accurate. Fit is not enough.

For our employee retention example, we tested three popular machine learning algorithms. Below you can see the Fit of each of the three models, more importantly you can see the output for a subset of the testing data.

We have taken an abbreviated look at how a data scientist might approach this scenario, but in the real world this is only a part of the solution. There are still questions surrounding how the model is served, how it is consumed within the business process and how a strategy is devised in order to retrain the model with updated data.

If you have questions, we have the answers. TruQua’s team of consultants and data scientists merge theory and practice to help customers gain deeper insights into their data for more informed decision making. For more information or to schedule a complimentary workshop that identifies what Machine Learning scenarios make sense for your business, contact us today at info@truqua.com.