Microsoft Sequence Clustering Algorithm

The Microsoft Sequence Clustering algorithm is a sequence analysis algorithm provided by Microsoft SQL Server 2005 Analysis Services (SSAS). You can use this algorithm to explore data that contains events that can be linked by following paths, or sequences. The algorithm finds the most common sequences by grouping, or clustering, identical sequences together. These sequences can take many forms, including:

Data that describes the click paths that users follow through a Web site.

Data that describes the order in which a customer adds items to a shopping cart at an online retailer.

This algorithm is similar to the Microsoft Clustering Algorithm. However, instead of finding clusters of cases that contain similar attributes, the Microsoft Sequence Clustering algorithm finds clusters of cases that contain similar paths in a sequence.

The mining model that this algorithm creates contains descriptions of the most common sequences in the data. You can use the descriptions to predict the next likely step of a new sequence. When the algorithm clusters records, it can also account for columns in the data that are not directly related to the sequences. Because the algorithm includes the unrelated columns, you can use the resulting model to identify relationships between sequenced data and data that does not occur in a sequence.

The Adventure Works company's Web site collects information about what pages site users visit, and about the order in which the pages are visited. Because the company provides online ordering, customers must log in to the site. This provides the company with click information for each customer profile. By using the Microsoft Sequence Clustering algorithm on this data, the company can find groups, or clusters, of customers who have similar patterns or sequences of clicks. The company can then use these clusters to analyze how users move through the Web site, to identify which pages are most closely related to the sale of a particular product, and to predict which pages are most likely to be visited next.

The algorithm uses the Expectation Maximization (EM) clustering method to identify clusters and their sequences. Specifically, the algorithm uses a probabilistic method to determine the probability that a data point exists in a cluster. For a description of how this clustering method is used in the Microsoft Clustering algorithm, see Microsoft Clustering Algorithm.

One of the input columns that the Microsoft Sequence Clustering algorithm uses is a nested table that contains sequence data. This data is a series of state transitions of individual cases in a dataset, such as product purchases or Web clicks. To determine which sequence columns to treat as input columns for clustering, the algorithm measures the differences, or distances, between all the possible sequences in the dataset. After the algorithm measures these distances, it can use the sequence column as an input for the EM method of clustering.

A sequence clustering model requires a key that identifies records, and a nested table that contains a sequence-related column, such as a Web page identifier, that identifies the events in a sequence. Only one sequence-related column is allowed for each sequence, and only one type of sequence is allowed in each model. To create a model in the scenario in the example earlier in this topic, you would need a data source that contains two tables. The first table would contain orders, and the second table would contain the sequence in which the orders were put into a shopping cart.

The Microsoft Sequence Clustering algorithm does not support using the Predictive Model Markup Language (PMML) to create mining models.

The Microsoft Sequence Clustering algorithm supports several parameters that affect the performance and accuracy of the resulting mining model. The following table describes each parameter.

Parameter

Description

CLUSTER_COUNT

Specifies the approximate number of clusters to be built by the algorithm. If the approximate number of clusters cannot be built from the data, the algorithm builds as many clusters as possible. Setting the CLUSTER_COUNT parameter to 0 causes the algorithm to use heuristics to best determine the number of clusters to build.

The default is 10.

MINIMUM_SUPPORT

Specifies the minimum number of cases in each cluster.

The default is 10.

MAXIMUM_SEQUENCE_STATES

Specifies the maximum number of states that a sequence can have. Setting this value to a number greater than 100 may cause the algorithm to create a model that does not provide meaningful information.

The default is 64.

MAXIMUM_STATES

Specifies the maximum number of states for a non-sequence attribute that the algorithm supports. If the number of states for a non-sequence attribute is greater than the maximum number of states, the algorithm uses the attribute’s most popular states and treats the remaining states as missing.