3 Strategies 1. Know what we are doing 2. Have a plan for the project 3. Assess models the way you want to use them 4. Do for the algorithms what they cannot do for themselves 5. Deploy models wisely 3

4 Strategy 1: Know What You are Doing What is Predictive Analytics? How does PA differ from Statistics BI Big Data 4

5 What is Predictive Analytics? Wikipedia Definitions Predictive analytics is an area of statistical analysis that deals with extracting information from data and uses it to predict future trends and behavior patterns. The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting it to predict future outcomes. 5 Abbott Analytics, Inc

6 What is Predictive Analytics? Other Definitions (in the news and blogs) Predictive Analytics is emerging as a game-changer. Instead of looking backward to analyze "what happened?" predictive analytics help executives answer "What's next?" and "What should we do about it? (Forbes Magazine, April 1, 2010) Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends. (searchcrm.com) Predictive Analytics *is* data mining re-badged because too many people were claiming to do data mining and weren't. (Tim Manns paraphrasing Wayne Erickson of TDWI) 6 Abbott Analytics, Inc

24 Gains Charts and Lift Curves Gains Lift random random Number Respondants are Rank-ordered (sorted) by predicted values X axis is the percentage of records as go down file. Gain is the pct. of target=1 found at indicated file depth Lift ratio is how many times more respondants at given customer depth compared to random selection Random gain has slope equal to proportion of respondants in training data Random lift is

32 How Sampling Effects Accuracy Measures For example, 95% non-responders (N), 5% responders (R) What s the Problem? (The justification for resampling) Sample is biased toward responders Models will learn non-responders better Most algorithms will generate models that say call everything a non-responder and get 93% correct classification! (I used to say this too) Most common solution: Stratify the sample to get 50%/50% (some will argue that one only needs 20-30% responders) 32

38 Why Data Science is Not Enough: Netflix Prize netflix-recommendations-beyond-5-stars.html There s more to a solution than accuracy you have to be able to use it! 38

39 Strategy 4 Do for algorithms what they can t do for themselves Get the data right Understand how algorithms can be fooled with correct data Outliers Missing Values Skew High Cardinality 39

40 Clean Data: Outliers Are the outliers problems? Some algorithms: yes Linear regression, nearest neighbor, nearest mean, principal component analysis In other words, algorithms that need mean values and standard deviations Some algorithms: no Decision trees, neural networks If outliers are problems for the algorithm Are they key data points? Do not remove these Consider taming outliers with transformations (features) Are they anomalies or otherwise uninteresting to the analysis Remove from data so that they don t bias models outliers 40

41 To KNIME KNIME outlier 41

42 Clean Data: Missing Values Missing data can appear as blank, NULL, NA, or a code such as 0, 99, 999, or -1. Fixing Missing Data: Delete the record (row), or delete the field (column) Replace missing value with mean, median, or distribution Replace with the missing value with an estimate Select value from another field having high correlation with variable containing missing values Build a model with variable containing missing values as output, and other variables without missing values as an input Other considerations Create new binary variable (1/0) indicating missing values Know what algorithms and software do by default with missing values Some do listwise deletion, some recode with 0, some recode with midpoints or means 42

48 Transforms: Scaling Data Income and Age Before normalization, income scale dwarfs age Value Normalized Person Income Number and Age Age Income z-score x* = ( x - mean ) / std Income and age on same scale Scale to range [0,1] x* = (x - x min ) / (x max - x min ) Zscore Value Person Number Norm_Age Norm_Income Both allow one to see both variables on same scale Can apply this to subsamples of data (regional data, for example) 48

57 Form of Models for Deployment: In-PA Software Deployment Run models through original software in ad hoc or automated process Benefits: Data prep done in software still there But still may have to trim down processing for efficiency no further work to be done to deploy Drawbacks Usually slower have to pull data out and push it back to database Software not usually optimized for speed; optimized for usability Requires a software expert to maintain and troubleshoot Analyst usually involved Errors not always handled gracefully 57

58 Form of Models for Deployment: External Call to PA Software Run models through original software in ad hoc or automated process, but as a call from the OS Benefits: Data prep done in software still there But still may have to trim down processing for efficiency no further work to be done to deploy Drawbacks Usually slower have to pull data out and push it back to database Software not usually optimized for speed; optimized for usability Requires a software expert to maintain and troubleshoot Analyst usually involved Errors not always handled gracefully 58

59 Form of Models for Deployment: Translation to Another Language Translate models into SQL, C (++, #, etc.), Java, PMML If in C/Java, can create standalone application just for the model scoring Benefits Get models out of software environment where they can be run and maintained by others Often run more efficiently in database or other environment Many tools provide export capabilities into other languages Drawbacks Translation of dataprep not usually included in tool export, requires significant time and QC/QA to ensure consistency with the tool Bug fixes take longer 59

60 Form of Models for Deployment: PMML Translate models into PMML Different than SQL, C, Java, etc. Benefits PMML supports (natively) entire predictive modeling process Language is simple Database support Online support for scalable scoring (Zementis) Drawbacks Translation of dataprep not usually included in predictive modeling software tools, requires coding Models are verbose Open source scoring options are limited 60

65 Data Preparation From Amazon.com Data Preparation for Data Mining by Dorian Pyle Paperback pages Bk&Cd Rom edition (March 15, 1999) Morgan Kaufmann Publishers; ISBN: ; Excellent resource for the part of data mining that takes the most time. Best book on the market for data preparation

66 Data Mining Methods From Amazon.com Handbook of Statistical Analysis and Data Mining Applications by Robert Nisbet, John Elder, Gary Miner Hardcover: 900 pages Publisher: Academic Press (April 23, 2009) Language: English ISBN-10: ISBN-13: New data mining book written for practitioners, with case studies and specifics of how problems were worked in Enterprise Miner, Clementine, STATISTICA, or another tool 66 66

67 Applied Predictive Analytics Learn the art and science of predictive analytics techniques that get results Predictive analytics is what translates big data into meaningful, usable business information. Written by a leading expert in the field, this guide examines the science of the underlying algorithms as well as the principles and best practices that govern the art of predictive analytics. It clearly explains the theory behind predictive analytics, teaches the methods, principles, and techniques for conducting predictive analytics projects, and offers tips and tricks that are essential for successful predictive modeling. Hands-on examples and case studies are included. 67 Publication Date: March 31, 2014 ISBN-10: ISBN-13: Edition: 1

68 IBM Modeler Recipes Go beyond mere insight and build models than you can deploy in the day to day running of your business Save time and effort while getting more value from your data than ever before Loaded with detailed step-by-step examples that show you exactly how it s done by the best in the business Book Details Language : English Paperback : 386 pages [ 235mm x 191mm ] Release Date : November 2013 ISBN : ISBN 13 : Author(s) : Keith McCormick, Dean Abbott, Meta S. Brown, Tom Khabaza, Scott Mutchler Topics and Technologies : All Books, Cookbooks, Enterprise 68

70 Data Mining Algorithms From Amazon.com Neural Networks for Pattern Recognition by Christopher M. Bishop Paperback (November 1995) Oxford Univ Press; ISBN: Excellent book for neural network algorithms, including some lesser known varieties. Described as Best of the best by Warren Sarle (Neural Network FAQ) 70 70

71 Data Mining Algorithms From Amazon.com Pattern Recognition and Neural Networks by Brian D. Ripley, N. L. Hjort (Contributor) Hardcover (October 1995) Cambridge Univ Pr (Short); ISBN: Ripley is a statistician who has embraced data mining. This book is not just about neural networks, but covers all the major data mining algorithms in a very technical and complete manner. Sarle calls this the best advanced book on Neural Networks 71 From Amazon.com The Elements of Statistical Learning by Trevor Hastie, Rob Tibsharani Jerome Friedman Hardcover (2001) Springer; ISBN: By 3 giants of the data mining community, I have read most of the book and can t think of a significant conclusion I disagree with them on. Very technical, but very complete. Topics covered in this book not usually covered in others such as kernel methods, support vector machines, principal curves, and many more. Has become my favorite technical DM book. Book has 200 color figures/charts first data mining book I ve seen that makes use of color, and this book does it right ElemStatLearn/download.html 71

75 Descriptions of Algorithms Neural Network FAQ ftp://ftp.sas.com/pub/neural/faq.html Statistical data mining tutorials by Andrew Moore, Carnegie Mellon A list of papers and abstracts from The University of Bonn Data Clustering and Visualization is a category of particular interest. Hasn t been updated since 2003, but still a good selection of papers. A Statistical Learning/Pattern Recognition Glossary by Thomas Minka. Very comprehensive list of data mining terms and glossary-like descriptions

Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within

International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK Agenda Analytics why now? The process around data and text mining Case Studies The Value of Information

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

Training (BAT) is a set of courses and workshops developed by Cognitro Analytics team designed to assist banks in making smarter lending, marketing and credit decisions. Analyze Data, Discover Information,

2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

CRISP-DM: The life cicle of a data mining project KDD Process Business understanding the project objectives and requirements from a business perspective. then converting this knowledge into a data mining

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management Paper Jean-Louis Amat Abstract One of the main issues of operators

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

CS490D Spring 2004 Final Solutions, May 3, 2004 Prof. Chris Clifton Time will be tight. If you spend more than the recommended time on any question, go on to the next one. If you can t answer it in the

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

CoolaData Predictive Analytics 9 3 6 About CoolaData CoolaData empowers online companies to become proactive and predictive without having to develop, store, manage or monitor data themselves. It is an