Cluster Analysis using SAS

This tutorial explains how to do cluster analysis in SAS. It also covers detailed explanation of various statistical techniques of cluster analysis with examples. Cluster analysis is mainly used for segmentation. It has gained popularity in almost every domain to segment customers.

Cluster Analysis

Finding similarities between data on the basis of the characteristics found in the data and grouping similar data objects into clusters. It is an unsupervised learning technique (No dependent variable).

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Cluster analysis works most appropriately with binary or continuous data (numeric variables). If you have categorical variables (ordinal or nominal data), you have to group them into binary values - either 0 or 1.

Another approach for handling categorical data :

We will create (k-1) variables for a categorical variable. For example, you have a categorical variable containing 3 categories - Retail , Bank and HR. We will create two variables for handling 3 levels taking one level as response value.

One variable would be Retail - all values of retail tag as 1 and other two levels as 0
Second variable would be Bank - all values of bank as 1 and other two levels as 0

Note : HR level was considered as response value. It is zero in both the variables.

By including Retail and Bank in the model, you will be able to capture all the three levels.

Check Multicollinearity

Multicollinearity means independent variables are highly correlated to each other. In cluster analysis, there is no dependent variable. Hence, all variables are considered independent to each other.

When variables used in clustering are highly correlated, some variables get a higher weight than others.

Correlation coefficients between predictor variables > 0.7 is an appropriate indicator for multicollinearity.

Standardize Continuous Variables

If one variable has a much wider range than others then this variable will tend to dominate. For example, if body measurements had been taken for a number of different people, the range (in mm) of heights would be much wider than the range in wrist circumference (in cm).

If you do not standardise your data then the variables measured in higher unit will dominate the computed dissimilarity and variables that are measured in small unit will contribute very little.

Prior running cluster analysis, we standardize all the analysis variables (real numeric variables) to a mean of zero and standard deviation of one (converted to z-scores).

Standardize Variables

SAS Code : Standardization

In the code below, input data set is named readin and output data set is named outdata. The analysis variables are V1 through V14.

proc standard data=readin out=outdata mean=0 std=1;
var V1-V14;
run;

Standardization can also be done using PROC STDIZE

proc stdize data=readin out=outdata method=std;
var V1-V14;
run;

In case you want to apply MIN-MAX standardization method which is (X-min)/(Max-Min). You can change method = range in PROC STDIZE.

Alternative Method to Standardize Continuous Variables

When you suspect that the data contain non-convex or non-spherical shape, you should estimate the within-cluster co-variance matrix to transform the data instead of standardization.

You can use the ACECLUS procedure to transform the data such that the resulting within-cluster covariance matrix is spherical. It computes canonical variables which would be used in the analyses further. The canonical variables are linear combination of the original variables.

Note : The VAR statement specifies that the canonical variables computed in the ACECLUS procedure are used in the cluster analysis. The ID statement specifies that the variable SRL should be added to the Tree output data set.

If the clusters have very different covariance matrices, PROC ACECLUS is not useful. In that case, you can rely on single linkage clustering.

proc cluster data=nonconvex outtree=tree method=SINGLE noprint;run;

Type of Clustering

K-means Clustering (Flat Clustering)

Hierarchical clustering (Agglomerative clustering)

K- means Clustering

In k-means clustering algorithm we take the number of inputs, represented with the k, the k is called as number of clusters from the data set. The value of k will define by the user and the each cluster having some distance between them, we calculate the distance between the clusters using the Euclidean distance formula.

Steps to perform k-means clustering

1. Choose the number of clusters k

2. Compute center of these clusters i.e. centroid or cluster seeds (mean of the points in a cluster) . We can take any random objects as the initial centroids or the first k objects in sequence.

3. Determine the distance of each object to the centroids using Euclidean distance.

4. Group the object based on minimum distance.

5. Computing New Cluster Seeds - Recompute the centroids (centers) of these clusters by taking mean of all points in each cluster formed above.

6. Repeat Steps 2 ,3, 4 and 5 until the centroids no longer change ( or convergence is reached) .

Step 1 : We choose 3 clusters.Step 2 : The initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. They are also called cluster seeds.Step 3 : We need to calculate the distance between each data points and the cluster centers using the Euclidean distance.

After completion of the iteration 2 the cluster points are not equal to the iteration 1 cluster points, and then we need to go for the iteration 3.

Step 6 : Check Convergence

The cluster seeds are no change between the Iteration 2 and the iteration 3, then we stop the iteration.

Limitations of k-means clustering

The number of clusters must be known before using k-means clustering.

Sensitive to outliers, noise as mean is used.

When the number of data are not so many, initial grouping will determine the cluster significantly.

Determine the number of clusters in k-means Clustering

Run k-means clustering code multiple times and look for consensus among the two statistics—that is, local peaks of the CCC and pseudo-F statistic.

1. Pseudo F

Look for Pseudo F to increase to a maximum as we increment the number of clusters by 1, and then observe when the Pseudo F starts to decrease. At that point we take the number of clusters at the (local) maximum.

2. Cubic Clustering Criterion (CCC)

Look for CCC to increase to a maximum as we increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point we take the number of clusters at the (local) maximum.

Note :

Largest value of CCC greater than 2 or 3 indicate good clusterings.

Largest value of CCC between 0 and 2 indicate possible clusters but should be interpreted cautiously.

Note :If you want a complete convergence (i.e. no relative change in cluster seeds), set converge = 0 and a large value for the MAXITER option.Explanation of the above code

The statement maxclusters= tells SAS to form the number of clusters using k-means algorithm.

The statement MEAN=[SAS-data-set] creates an output data set mean that contains the cluster means and other statistics for each cluster.

The statement out=[SAS-data-set] creates anoutput data set that contains the original variables and two new variables, cluster and distance. The variable cluster contains the cluster identification number to which each observation has been assigned. The variable distance contains the distance from the observation to its cluster seed.

The next thing we want to do is to look at the cluster means. We can characterize the individual clusters.

SAS Macro for k-means clustering

This macro helps us to run the fastclus code multiple times and look for consensus among the two statistics—that is, local peaks of the CCC and pseudo-F statistic.

Another metrics to check - Rsquare value. The R-square value increases as more clusters are specified. The optimal number of numbers would be where R-square reaches local maximum and then starts falling or not increasing much. However, Milligan and Cooper demonstrated that changes in the R-Square are not very useful for estimating the number of clusters, but it may be useful if you are interested solely in data reduction.

The macro variable K represents the number of clusters defined in the PROC FASTCLUS procedure.

Suppose optimal number of clusters come out 4. We need to check whether or not the clusters overlap with each other in terms of their location in the k-dimensional space 14 variables. It is not possible to visualize clusters in 14 dimensions. To work around this problem, we can use canonical discriminant analysis which is a data reduction technique that creates a smaller number of variables that are linear combinations of the 14 clustering variables. The new variables called canonical variables are ordered in terms of the proportion of variance in the clustering variable that is accounted for by each of the canonical variables. So the first canonical variable will account for the largest proportion of the variance.

Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

Compute distances (similarities) between the new cluster and each of the old clusters.

Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Clustering algorithms

Single linkage or nearest neighbor – the similarity between clusters is the shortest distance between any object in one cluster and any object in the other cluster. It is the most commonly used and its very flexible. It can define a wide range of clustering patterns. When clusters are poorly delineated could create problems.

Complete linkage or farthest neighbor - Cluster similarity is based on the maximum distance between observations in each cluster. Similarity between the clusters is the smallest circle that could encompass both of them. Eliminates some of the problems of earlier method and has been found to generate the most compact clustering solutions.

Centroid method – The similarity between two clusters is the distance between its centroids. They could produce confusing results.

Ward’s Method – The measures of similarity are the sum of squares within the cluster summed over all variables. The retained clusters are the ones with the smallest values.

Basically, Ward’s Method looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association.

Determine the number of clusters in Hierarchical Clustering

Look for consensus among the three statistics—that is, local peaks of the CCC and pseudo-F statistic combined with a small value of the pseudo-T2 statistics followed by quick increasing value of pseudo-T2 statistics for the next cluster fusion. These criteria are appropriate only for compact or slightly elongated clusters, preferably clusters that are roughly multivariate normal.

1. Pseudo T2 statistic

Look for the first relatively large value, then move up one cluster (clustering in step k+1 is selected as the optimal cluster).

2. Pseudo F statistic

Look for Pseudo F to increase to a maximum as we increment the number of clusters by 1, and then observe when the Pseudo F starts to decrease. At that point we take the number of clusters at the (local) maximum.

3. Cubic Clustering Criterion (CCC)

Look for CCC to increase to a maximum as we increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point we take the number of clusters at the (local) maximum.

Largest value of CCC (Peaks on the plot with the CCC) greater than 2 or 3 indicate good clusterings.

Largest value of CCC (Peaks with the CCC) between 0 and 2 indicate possible clusters but should be interpreted cautiously.

If the CCC increases continually as the number of clusters increases, the distribution may be grainy or the data may have been excessively rounded or recorded with just a few digits.

A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. In step 0, each observation begins in a cluster by itself . In each successive steps, find the closest (most similar) pair of clusters and merge them into a single cluster.

How to interpret Dendrogram

In the example above, data point 4 and 5 are more similar to each other than to data point 3. In addition, data points 1 and 2 are more similar to each other than 4 and 5 are to 3.

The OUT= [Dataset Name] creates output data sets that contain the results of hierarchical clustering as a tree structure.

The VAR statement lists numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. The ":" represents all the variables named with the keyword preceding colon sign.

The COPY statement specifies extra variables to be copied to the OUT= data set.

PROC TREE

The PROC TREE procedure creates a data set containing a variable cluster tells the cluster identification number to which each observation has been assigned.

The NCLUSTERS= option specifies the number of clusters desired in the OUT= data set.

The COPY statement specifies extra variables to be copied to the OUT= data set.

Interpretation of Results

Semipartial R-squared

Semipartial R-square is a measure of the homogeneity of merged clusters, so Semipartial R-squared is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups.

R-square

R-square measures the extent to which groups or clusters are different from each other (so, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high.or as close to 1 as possible as it explains the proportion of variance accounted for by the clusters.It is an important method to evaluate quality of clusters.

Cubic Clustering Criterion

The Cubic Clustering Criterion (CCC) is a comparative measure of the deviation of the clusters from the distribution expected if data points were obtained from a uniform distribution.

Larger positive values of the CCC indicate a better solution, as it shows a larger difference from a uniform (no clusters) distribution. However, the CCC may be incorrect if clustering variables are highly correlated.

Pseudo-F statistic

The pseudo-F statistic is intended to capture the 'tightness' of clusters, and is in essence a ratio of the mean sum of squares between groups to the mean sum of squares within group.

Larger numbers of the pseudo-F usually indicate a better clustering solution. If pseudo-F decreases with k and reaches a maximum value, the value of k at the maximum or immediately prior to the point may be a candidate for the value of k.

Centroid Distance

Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small.

Best Approach : Combination of both techniques

First use a hierarchical technique to generate a complete set of cluster solutions and establish the appropriate number of clusters. Then, you use a k-means (nonhierarchical) method.

One should analyze and examine the fundamental in the defined clusters. Clusters with small number of observations should be through fully examined – do they represent valid components or simply outliers?

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Hi Deepanshu. I've just come across your blog. Thanks for the great work.Now, how can I do cluster analysis in SAS for repeated observations? I have preference data where each of my 120 subjects rated 4 product profiles giving a total of 480 observations. Then I replicated it once to get 960 observations. How can I use cluster analysis to group the 960 observations into clusters taking into account that 4 observations belong to one respondent?Thanks. PatrickKindly send me a reply on email: patrickirungu@yahoo.com