IBM SPSS Modeler Cookbook

If you’ve already had some experience with IBM SPSS Modeler this cookbook will help you delve deeper and exploit the incredible potential of this data mining workbench. The recipes come from some of the best brains in the business.

IBM SPSS Modeler Cookbook

Cookbook

Keith McCormick et al.October 2013

If you’ve already had some experience with IBM SPSS Modeler this cookbook will help you delve deeper and exploit the incredible potential of this data mining workbench. The recipes come from some of the best brains in the business.

Book Details

About This Book

Go beyond mere insight and build models than you can deploy in the day to day running of your business

Save time and effort while getting more value from your data than ever before

Loaded with detailed step-by-step examples that show you exactly how it’s done by the best in the business

Who This Book Is For

If you have had some hands-on experience with IBM SPSS Modeler and now want to go deeper and take more control over your data mining process, this is the guide for you. It is ideal for practitioners who want to break into advanced analytics.

Table of Contents

Chapter 1: Data Understanding

Introduction

Using an empty aggregate to evaluate sample size

Evaluating the need to sample from the initial data

Using CHAID stumps when interviewing an SME

Using a single cluster K-means as an alternative to anomaly detection

Using an @NULL multiple Derive to explore missing data

Creating an Outlier report to give to SMEs

Detecting potential model instability early using the Partition node and Feature Selection node

Chapter 2: Data Preparation – Select

Introduction

Using the Feature Selection node creatively to remove or decapitate perfect predictors

Running a Statistics node on anti-join to evaluate the potential missing data

Evaluating the use of sampling for speed

Removing redundant variables using correlation matrices

Selecting variables using the CHAID Modeling node

Selecting variables using the Means node

Selecting variables using single-antecedent Association Rules

Chapter 3: Data Preparation – Clean

Introduction

Binning scale variables to address missing data

Using a full data model/partial data model approach to address missing data

Imputing in-stream mean or median

Imputing missing values randomly from uniform or normal distributions

Using random imputation to match a variable's distribution

Searching for similar records using a Neural Network for inexact matching

Using neuro-fuzzy searching to find similar names

Producing longer Soundex codes

Chapter 4: Data Preparation – Construct

Introduction

Building transformations with multiple Derive nodes

Calculating and comparing conversion rates

Grouping categorical values

Transforming high skew and kurtosis variables with a multiple Derive node

Using classification trees to explore the predictions of a Neural Network

Correcting a confusion matrix for an imbalanced target variable by incorporating priors

Using aggregate to write cluster centers to Excel for conditional formatting

Creating a classification tree financial summary using aggregate and an Excel Export node

Reformatting data for reporting with a Transpose node

Changing formatting of fields in a Table node

Combining generated filters

Chapter 8: CLEM Scripting

Introduction

Building iterative Neural Network forecasts

Quantifying variable importance with Monte Carlo simulation

Implementing champion/challenger model management

Detecting outliers with the jackknife method

Optimizing K-means cluster solutions

Automating time series forecasts

Automating HTML reports and graphs

Rolling your own modeling algorithm – Weibull analysis

What You Will Learn

Use and understand the industry standard CRISP_DM process for data mining.

Assemble data simply, quickly, and correctly using the full power of extraction, transformation, and loading (ETL) tools.

Control the amount of time you spend organizing and formatting your data.

Develop predictive models that stand up to the demands of real-life applications.

Take your modeling to the next level beyond default settings and learn the tips that the experts use.

Learn why the best model is not always the most accurate one.

Master deployment techniques that put your discoveries to work making the most of your business’ most critical resources.

Challenge yourself with scripting for ultimate control and automation - it’s easier than you think!

In Detail

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork.

IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art.

Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace.

Go beyond the basics and get the full power of your data mining workbench with this practical guide.

Authors

Keith McCormick

Keith McCormick is the Vice President and General Manager of QueBIT Consulting's Advanced Analytics team. He brings a wealth of consulting/training experience in statistics, predictive modeling and analytics, and data mining. For many years, he has worked in the SPSS community, first as an External Trainer and Consultant for SPSS Inc., then in a similar role with IBM, and now in his role with an award winning IBM partner. He possesses a BS in Computer Science and Psychology from Worcester Polytechnic Institute.
He has been using Stats software tools since the early 90s, and has been training since 1997. He has been doing data mining and using IBM SPSS Modeler since its arrival in North America in the late 90s. He is an expert in IBM's SPSS software suite including IBM SPSS Statistics, IBM SPSS Modeler (formally Clementine), AMOS, Text Mining, and Classification Trees. He is active as a moderator and participant in statistics groups online including LinkedIn's Statistics and Analytics Consultants Group. He also blogs and reviews related books at KeithMcCormick.com. He enjoys hiking in out of the way places, finding unusual souvenirs while traveling overseas, exotic foods, and old books.

Dean Abbott

Dean Abbott is the President of Abbott Analytics, Inc. in San Diego, California. He has over two decades experience in applying advanced data mining, data preparation, and data visualization methods in real-world data intensive problems, including fraud detection, customer acquisition and retention, digital behavior for web applications and mobile, customer lifetime value, survey analysis, donation solicitation and planned giving. He has developed, coded, and evaluated algorithms for use in commercial data mining and pattern recognition products, including polynomial networks, neural networks, radial basis functions, and clustering algorithms for multiple software vendors.
He is a seasoned instructor, having taught a wide range of data mining tutorials and seminars to thousands of attendees, including PAW, KDD, INFORMS, DAMA, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. He also has taught both applied and hands-on data mining courses for major software vendors, including IBM SPSS Modeler, Statsoft STATISTICA, Salford System SPM, SAS Enterprise Miner, IBM PredictiveInsight, Tibco Spotfire Miner, KNIME, RapidMiner, and Megaputer Polyanalyst.

Meta S. Brown

Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on analyst who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics.
She is devoted to educating the business community on effective use of statistics, data mining, and text mining. A sought-after analytics speaker, she has conducted over 4000 hours of seminars, attracting audiences across North America, Europe, and South America. Her articles appear frequently on All Analytics, Smart Data Collective, and other publications. She is also co-author of Big Data, Mining and Analytics: Key Components for Strategic Decisions (forthcoming from CRC Press, Editor: Stephan Kudyba).
She holds a Master of Science in Nuclear Engineering from the Massachusetts Institute of Technology, a Bachelor of Science in Mathematics from Rutgers University, and professional certifications from the American Society for Quality and National Association for Healthcare Quality. She has served on the faculties of Roosevelt University and National-Louis University.

Tom Khabaza

Tom Khabaza is an independent consultant in predictive analytics and data mining, and the Founding Chairman of the Society of Data Miners. He is a data mining veteran of over 20 years and many industries and applications. He has helped to create the IBM SPSS Modeler (Clementine) data mining workbench and the industry standard CRISP-DM methodology, and led the first integrations of data mining and text mining. His recent thought leadership includes the 9 Laws of Data Mining.

Scott R. Mutchler

Scott R. Mutchler is the Vice President of Advanced Analytics Services at QueBIT Consulting LLC. He had spent the first 17 years of his career building enterprise solutions as a DBA, software developer, and enterprise architect. When Scott discovered his true passion was for advanced analytics, he moved into advanced analytics leadership roles where he was able to drive millions of dollars in incremental revenues and cost savings through the application of advanced analytics to most challenging business problems. His strong IT background turned out to be a huge asset in building integrated advanced analytics solutions.
Recently, he was the Predictive Analytics Worldwide Industrial Sector Lead for IBM. In this role, he worked with IBM SPSS clients worldwide. He architected advanced analytic solutions for clients in some of the world's largest retailers and manufacturers.
He received his Masters from Virginia Tech in Geology. He stays in Colorado and enjoys an outdoor lifestyle, playing guitar, and travelling.

Table of Contents

Chapter 1: Data Understanding

Introduction

Using an empty aggregate to evaluate sample size

Evaluating the need to sample from the initial data

Using CHAID stumps when interviewing an SME

Using a single cluster K-means as an alternative to anomaly detection

Using an @NULL multiple Derive to explore missing data

Creating an Outlier report to give to SMEs

Detecting potential model instability early using the Partition node and Feature Selection node

Chapter 2: Data Preparation – Select

Introduction

Using the Feature Selection node creatively to remove or decapitate perfect predictors

Running a Statistics node on anti-join to evaluate the potential missing data

Evaluating the use of sampling for speed

Removing redundant variables using correlation matrices

Selecting variables using the CHAID Modeling node

Selecting variables using the Means node

Selecting variables using single-antecedent Association Rules

Chapter 3: Data Preparation – Clean

Introduction

Binning scale variables to address missing data

Using a full data model/partial data model approach to address missing data

Imputing in-stream mean or median

Imputing missing values randomly from uniform or normal distributions

Using random imputation to match a variable's distribution

Searching for similar records using a Neural Network for inexact matching

Using neuro-fuzzy searching to find similar names

Producing longer Soundex codes

Chapter 4: Data Preparation – Construct

Introduction

Building transformations with multiple Derive nodes

Calculating and comparing conversion rates

Grouping categorical values

Transforming high skew and kurtosis variables with a multiple Derive node

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.