In the area of data mining, spreadsheets present a great
opportunity. Although you cannot use spreadsheet programs to do actual data
mining, you can use them to gather data, display results, and get
user-reviewed information into and insight out of IBM® SPSS®
Statistics. In this article, learn how to bring a spreadsheet of raw data into
SPSS Statistics and apply two classification algorithms to create customer
segmentation models. Then, use options in SPSS Statistics to create persistent
files that contain the rules for the models that can be used for both
deployment of customer classifications back to spreadsheets and into a big
data environment.

David Gillman has worked in the areas business intelligence, data mining and predictive analytics for 20 years. His educational background is in applied math, optimization, and statistical analysis, with a particular emphasis on application to commercial activities. He has hands-on experience in improving business operations through applied analytics in the distribution, manufacturing, retail, and hospitality industries with organizations of various sizes. You can reach David at gillman@datasooner.com.

Unless your company is a major retailer, you can probably list your
customers in a single spreadsheet. Although not the most advanced or
technically sophisticated method, you can easily gather the data elements
about each customer in a spreadsheet.

A spreadsheet is useful when you create customer segmentation models. You
can use it to collect data from many sources easily, distribute it for
review, and edit it to increase accuracy.

IBM SPSS Statistics makes it easy to use that spreadsheet, which is good,
because you can do so repeatedly. As you analyze results and talk to other
people, you can add new fields, and then run the modeling process again.

Customer
characteristics

You begin by gathering all of the relevant and required information about
your customers into one spreadsheet. The first question typically is,
which characteristics do you use?

I think of the types of customer characteristics as falling into one of
three categories. First, there are the characteristics that most people
usually come up with first. Where is the customer located? What is the
customer's industry? How many employees does it have? What is its revenue?
How many regions is the customer in? These characteristics are the
demographic characteristics of your customers, and your
customer relationship management (CRM) systems often already contain these
data points.

Second, there are characteristics of your customer's behavior.
These behavior characteristics are data points, such as, the number of
orders in a month, the average value of orders, and the number of days to
pay. Often, you use queries to extract this information from your
enterprise resource planning system. You might already have such
behavioral characteristics of your customers available now. Sometimes, you
create new calculations in queries to get new numbers.

Third, there are characteristics of your customers that do not come from
any centralized database. Examples of this type of information include an
assessment of the relationship quality from your salesperson, or a rating
that is based on the number of returns or complaints. You might have to
add this type of data manually.

SPSS Statistics methods to create
segmentation models

SPSS Statistics has several statistical algorithms for creating
segmentation. It has more than this article can cover in the allotted
space and more than you probably want to read about in one sitting, but
here's the quick list:

Two step

K-Means

Hierarchical

Tree

Discriminant

Nearest neighbor

These are the top hits of the clustering algorithms in general use. You can
also throw a neural network on that list, but in SPSS Statistics, that
algorithm is listed separately.

Each of these algorithms has strengths and weaknesses, depending on the
amount of data you have, the type or characteristics of the variables, and
your end purpose in classifying the data. I concentrate on two of the
algorithms for this article: K-Means and Tree. (Tree in this case really
is more broadly called Decision Trees.)

After your data is in the spreadsheet and brought into the SPSS Statistics
Data Editor, you can choose which algorithm to work with.

Hands on with SPSS
Statistics

Figure 1. Spreadsheet data in the SPSS
Statistics Data Editor

K-Means

K-Means is a popular clustering algorithm. The key concept of the K-Means
algorithm to understand is that it randomly picks a center point for each
class. Then, the algorithm groups members into the class of the point that
is closest to the member. In most cases, that is the Euclidean distance in
multidimensional space. Regardless, the next substep is to find the center
point (usually called the centroid) of each group. Because the
first point was randomly chosen, you can see that the new center is
different.

After you find the new centroid, the distance from all points is calculated
again and the members are regrouped based on the moved centroid. This
process is repeated until the change in the center positioning stops or
becomes so small as not to matter.

To use the K-Means clustering option, click Classify > K-Means
Cluster from the Analyze list on the main
menu of the Data Editor. A window similar to Figure 2
appears.

Figure 2. The K-Means algorithm's main page

Move the variables in the left list that you want to use in your analysis
to the Variables list. Likewise, select a column to use
as the unique record identifier and provide it in the Label Cases
by field. For customer classification, that ID is invariably
a customer number.

Be careful at this stage not to wantonly drop all the variables over
without first checking their usefulness. Sometimes, anachronistic
variables can creep in here. For example, if you have a field that already
has a classifier such as a customer rating given by salespeople, that
information might greatly influence where the clusters end up.
Fortunately, K-Means is not as susceptible to having this already-grouped
variable as some of the other algorithms.

Next, adjust the number of clusters you would like to see in the end. Now,
your window looks like Figure 3.

Figure 3. K-Means with configuration options

When you're happy with your choices, click OK. In the
future, you can experiment with the Iterate and
Options buttons. They might change outcomes but
require that you know of the algorithm and the effect tweaking might have.
In the Method box, make sure that the Iterate and
classify option is selected.

In the Cluster Centers box, select the Write
final check box. Select the Data file
option; then, click File and give the file a name in the
file explorer that appears. Remember where this file resides.

Figure 4. K-Means writing results to a file

Figure 5. K-Means results in the Viewer

Congratulations! You created a clustering classification of your customers.
Now, you can apply the algorithm to new data to see how it looks against a
different set of customers or over time apply it to the customer file as
the data changes.

To do that, bring the new data set of customers from the spreadsheet into
the SPSS Statistics Data Viewer. Click Analyze >
Classify, and then select the K-Means
Clustering option. The same window—K-Means
Cluster Analysis—appears. Move the columns in the
spreadsheet over to the Variables list.

Here is where the process is different. Change the options from the first
time you ran the algorithm to generate the model. Specifically, in the
Method box, select the Classify only
option. Then, in Cluster Centers, select the Read
initial check box. Select the External data
file options, and then click File, use the
file explorer to navigate to the file that the K-Means algorithm wrote in
the earlier process. Your window now looks like Figure
6.

Figure 6. K-Means reading in an existing model

Click Save. In the K-Means Cluster: Save
New window, which is shown in Figure 7,
select the Cluster membership and Distance from
cluster center check boxes. Then, click
Continue.

Figure 7. K-Means save options

These options display the cluster membership for each row (case or
customer) in the spreadsheet that is in the Data Editor window.

Now, click OK to allow SPSS Statistics use the previously
generated model to classify the new customers. Two new columns appear in
the Data Editor: the cluster membership and the distance measure for each
customer. Click File > Save in the Data Viewer to save
this information to a spreadsheet so you can integrate the classification
into your business processes.

Voilà! You moved from spreadsheet to model and back to spreadsheet.

Tree (Decision Tree)

Decision trees are far from the most sophisticated algorithm available from
the Classify submenu. That said, however, they are about
the easiest to explain to business people. To use the Decision Tree
algorithm, you read the spreadsheet of all your customers into the SPSS
Data Editor.

There is one difference in the data from K-Means: In K-Means, I said to
keep information such as salesperson classifications out of the incoming
data. In algorithms like K-Means, such variables can influence and
potentially overwhelm the other variables, proving only that the customers
can be grouped as the salespeople already group them.

In Decision Trees, you need a variable that is the target variable. In
other words, you need a column that already classifies your customers. In
this exercise, I use a sales-based classification because such a
classification probably exists in your company somewhere. The existing
classification might need polishing and cleaning before you use it
formally, but it's likely the best place to get a target variable for
Decision Trees to use.

Let's walk through the Decision Tree menu boxes to see how this works in
SPSS Statistics:

Read your spreadsheet of customer information into the Data
Editor.

Click Analyze > Classify, and then select the
Tree Clustering option.

Different from when
you selected K-Means, the Decision Tree window,
which is shown in Figure 8, appears before you
configure the algorithm.

Figure 8. The Decision Tree algorithm
variable warning window

Click Define Variable Properties.

The
Define Variable Properties window, which is
shown in Figure 9, appears but with all the
variables in the Variables list. Move the
variables for which you want to adjust the properties to the
Variables to Scan list.

Figure 9. The Decision Tree Variable
Definition box

Select those variables that might represent an ordering, such as
A, B, and
C, where A is the best and
C is the worst.

A variable whose member values
represent a ranking or order that the software probably won't
detect—known as an ordinal variable. Likewise, a
nominal variable is one where the values are
categories, but there is no order. Familiar examples are colors.
There is no order to blue, black, and yellow in commercial data.
Use the same drop-down list to make appropriate variables nominal.

Also, be on the lookout for variables that you think might be in
between. For example, clothing size can be considered either
nominal or ordinal depending on your circumstances. When you get
to that point, you are in the minutiae of applied statistics.

Click Continue.

Regardless of the variables you chose, the Define Variable
Properties window, which is shown in Figure
10, is where you class them. For this exercise, I classed some of
the variables, such as the SIC code for the type of business the customer
is in, as nominal. Others, like the payment history field, I classed as
ordinal because there is a category for better-paying customers that goes
to nonpaying customers in descending order.

This window contains other options for better defining the properties of
your variables, but they are beyond the scope of this article.

When you are done defining the characteristics of your variables, click
OK to return to the Data Editor. Start the Tree
Clustering algorithm again from the menu. If the option comes up again to
set the properties of each variable, click OK.

Now, you're at the heart of the Decision Tree window.

There are many resources on the Internet from which you can learn about
Decision Trees, the different statistical algorithms that you can employ,
and how those algorithms' parameters function and influence outcomes. I
walk you through the simple workings of the Tree algorithm so that you can
begin to use it and learn the more complex options later. The windows that
appear when you click Criteria or
Options contain many features that can influence the
processing of the Tree model, such as those features that affect variable
ratings, tree pruning, and miscalculation costs.

In the main window, move the variables that you want to use to build the
tree model from the Variables list to the
Independent Variables list, as shown in Figure 11. Also, move a single variable to the
Dependent Variable list. The dependent variable is
the target variable that I discussed earlier.

Figure 11. The Decision Tree algorithm menu
window

Next, click Save. When the Decision Tree:
Output window appears, click the Rules tab,
which is shown in Figure 12. In the
Syntax area, I selected the SQL
option, selected the Export rules to a file check box,
and then specified a file in to export the rules to. This feature is great
for integrating the classification into business applications like CRM and
reports. You might have to edit the Structured Query Language (SQL) and
paste it into reports or programs, but it is a phenomenal shortcut to
deploying the Tree model.

Figure 12. Determine the output type and
location for the Decision Tree algorithm

Click Continue, then click Save. In Figure 13, I specified a file to which I want to
output the tree model. With this important feature of the tree model you
can integrate the tree model rules into other applications. You can even
use the rules in the XML file to power a big data classification process.

Figure 13. Saving the Decision Tree XML file

After you specify a file in which to store the tree rules, click
Continue.

To recap the last couple of steps, you created two output files, each of
which contains the rules of the Decision Tree. One is in SQL format, and
the other is in XML format.

In the main window, click Validation. The Decision
Tree: Validation window, which is shown in Figure 14, appears. Here is where my previous discussion of
training and validation sets is useful. Select the percentage split you
want to train with; the rest is dedicated to the test set. I also leave
the default option in the Display Results For
area—Training and test
samples—selected.

Figure 14. The Decision Tree: Validation
window

These options display in the Data Editor based on how the model classifies
each case or customer. Results of comparing the model performance to the
validation set of data are shown in the SPSS Statistics Viewer.

Click Continue to return to the main Decision Tree menu.
Then, click OK to run the modeling process. The rules are
placed in the XML file that you specified in the Save options. Likewise,
the SQL file has the same rules.

Big data and customer
segmentation

Now that you have the basics for generating a segmentation model, let's
broaden the topic to how these models and your skills can be deployed in
the context of big data.

I use a general definition of big data—that is, when a flow
of data has too much variety and comes in too fast for manual analysis.
Applying a classification model in that context allows the automated
classifiers to grade or segment customers in real time. As new customers
come in or old customers change their buying patterns, with big data you
can adjust the marketing and sales process in real time.

Imagine a situation where your company has new data feeds in the
future—radio-frequency ID chips for product movement, customer
sentiment analysis that is based on incoming emails, news feeds, and
weather, among other potentials. Using a tool like IBM InfoSphere®
BigInsights™ you can manage those incoming data feeds and store the
data for longer-term use.

Combining the tools inside InfoSphere BigInsights with the XML and SQL
rules from SPSS Statistics, you can classify and reclassify customers as
data flows into InfoSphere BigInsights. Imagine the benefits that you will
gain when the database automatically notifies people when a customer moves
from one segment to another. Your internal business people will be
ecstatic to receive that information in real time.

For now, most people are just beginning to work with the concepts of big
data. Fortunately, you can use IBM InfoSphere BigInsights Basic Edition
for now at no charge (see Resources). When you
begin to deploy big data into a production environment, you can move up to
InfoSphere BigInsights Enterprise Edition.

Conclusion

SPSS Statistics can do impressive data mining and predictive analytics
work. Segmenting customers is a natural function when data mining. You can
use the basic tools that you have around to analyze and deploy a customer
segmentation model. You can deploy the segmentation information for a wide
variety of uses, including right back into the spreadsheets of your
business users.

Moreover, customer segmentation is one area that you can use now, with the
same models deployed into a big data environment to future-proof your
hard-done analytical work.

Get products and technologies

Evaluate IBM
products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few
hours in the SOA Sandbox learning how to implement service-oriented
architecture efficiently.

Discuss

Get involved in the developerWorks
community. Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.