Querying a Clustering Model (Analysis Services - Data Mining)

When you create a query against a data mining model, you can retrieve metadata about the model, or create a content query that provides details about the patterns discovered in analysis. Alternatively, you can create a prediction query, which uses the patterns in the model to make predictions for new data. Each type of query will provide different information. For example, a content query might provide additional details about the clusters that were found, whereas a prediction query might tell you to which cluster a new data point is most likely to belong.

This section explains how to create queries for models that are based on the Microsoft Clustering algorithm.

All mining models expose the content learned by the algorithm according to a standardized schema, the mining model schema rowset. You can create queries against the mining model schema rowset by using Data Mining Extension (DMX) statements. In SQL Server 2008, you can also query the schema rowsets directly as system tables.

Sample Query 1: Getting Model Metadata by Using DMX

The following query returns basic metadata about the clustering model, TM_Clustering, that you created in the Basic Data Mining Tutorial. The metadata available in the parent node of a clustering model includes the name of the model, the database where the model is stored, and the number of child nodes in the model. This query uses a DMX content query to retrieve the metadata from the parent node of the model:

Sample Query 2: Retrieving Model Metadata from the Schema Rowset

By querying the data mining schema rowset, you can find the same information that is returned in a DMX content query. However, the schema rowset provides some additional columns. These include the parameters that were used when the model was created, the date and time that the model was last processed, and the owner of the model.

The following example returns the date the model was created, modified, and last processed, together with the clustering parameters that were used to build the model, and the size of the training set. This information can be useful for documenting the model, or for determining which of the clustering options were used to create an existing model.

The most useful content queries on clustering models generally return the same type of information that you can browse by using the Cluster Viewer. This includes cluster profiles, cluster characteristics, and cluster discrimination. This section provides examples of queries that retrieve this information.

Sample Query 3: Returning a Cluster or List of Clusters

Because all clusters have a node type of 5, you can easily retrieve a list of the clusters by querying the model content for only the nodes of that type. You can also filter the nodes that are returned by probability or by support, as shown in this example.

The attributes that define the cluster can be found in two columns in the data mining schema rowset.

The NODE_DESCRIPTION column contains a comma-separated list of attributes. Note that the list of attributes might be abbreviated for display purposes.

The nested table in the NODE_DISTRIBUTION column contains the full list of attributes for the cluster. If your client does not support hierarchical rowsets, you can return the nested table by adding the FLATTENED keyword before the SELECT column list. For more information about the use of the FLATTENED keyword, see SELECT FROM <model>.CONTENT (DMX).

Sample Query 4: Returning Attributes for a Cluster

For every cluster, the Cluster Viewer displays a profile that lists the attributes and their values. The viewer also displays a histogram that shows the distribution of values for the whole population of cases in the model. If you are browsing the model in the viewer, you can easily copy the histogram from the Mining Legend and then paste it to Excel or a Word document. You can also use the Cluster Characteristics pane of the viewer to graphically compare the attributes of different clusters.

However, if you must obtain values for more than one cluster at a time, it is easier to query the model. For example, when you browse the model, you might notice that the top two clusters differ with regard to one attribute, Number Cars Owned. Therefore, you want to extract the values for each cluster.

The first line of the code specifies that you want only the top two clusters.

Note

By default, the clusters are ordered by support. Therefore, the NODE_SUPPORT column can be omitted.

The second line of the code adds a sub-select statement that returns only certain columns from the nested table column. Furthermore, it restricts the rows from the nested table to those related to the target attribute, Number Cars Owned. To simplify the display, the nested table is aliased.

Note

The nested table column, PROBABILITY, must be enclosed in brackets because it is also the name of a reserved MDX keyword.

As a shortcut, rather than writing your own queries by using DMX, you can also call the system stored procedures that Analysis Services uses to work with clusters. The following example illustrates how to use the internal stored procedures to return the profile for a cluster with the ID of 002.

The data mining system stored procedures are for internal use and Microsoft reserves the right to change them as needed. For production use, we recommend that you create queries by using DMX, AMO, or XMLA.

Sample Query 6: Finding Discriminating Factors for a Cluster

The Cluster Discrimination tab of the Cluster Viewer enables you to easily compare a cluster with another cluster, or compare a cluster with all remaining cases (the complement of the cluster).

However, creating queries to return this information can be complex, and you might need some additional processing on the client to store the temporary results and compare the results of two or more queries. As a shortcut, you can use the system stored procedures.

The following query returns a single table that indicates the primary discriminating factors between the two clusters that have the node IDs of 009 and 007. Attributes with positive values favor cluster 009, whereas attributes with negative values favor cluster 007.

This is the same information that is presented in the chart of the Cluster Discrimination viewer if you select Cluster 9 from the first drop-down list and Cluster 7 from the second drop-down list. To compare Cluster 9 with its complement, you use the empty string in the second parameter, as shown in the following example:

The data mining system stored procedures are for internal use and Microsoft reserves the right to change the arguments or behavior as needed. For production use, we recommend that you create queries by using DMX, AMO, or XMLA.

Sample Query 7: Returning Cases that Belong to a Cluster

If drillthrough has been enabled on the mining model, you can create queries that return detailed information about the cases used in the model. Moreover, if drillthrough has been enabled on the mining structure, you can include columns from the underlying structure by using the StructureColumn (DMX) function.

The following example returns two columns that were used in the model, Age and Region, and one more column, First Name, that was not used in the model. The query returns only cases that were classified into Cluster 1.

To return the cases that belong to a cluster, you must know the ID of the cluster. You can obtain the ID of the cluster by browsing the model in one of the viewers. Or, you can rename a cluster for easier reference, after which you could use the name in place of an ID number. However, know that the names that you assign to a cluster will be lost if the model is reprocessed.

Although clustering is typically used for describing and understanding data, the Microsoft implementation also lets you make prediction about cluster membership, and return probabilities associated with the prediction. This section provides examples of how to create prediction queries on clustering models. You can make predictions for multiple cases, by specifying a tabular data source, or you can provide new values on at a time by creating a singleton query. For clarity the examples in this section are all singleton queries.

Sample Query 8: Predicting Outcomes from a Clustering Model

If the clustering model you create contains a predictable attribute, you can use the model to make predictions about outcomes. However, the model handles the predictable attribute differently depending on whether you set the predictable column to Predict or PredictOnly. If you set the usage of the column to Predict, the values for that attribute are added to the clustering model and appear as attributes in the finished model. However, if you set the usage of the column to PredictOnly, the values are not used to create clusters. Instead, after the mode is completed, the clustering algorithm creates new values for the PredictOnly attribute based on the clusters to which each case belongs.

The following query provides a single new case to the model, where the only information about the case is the age and gender. The SELECT statement specifies the predictable attribute/value pair that you are interested in, and the PredictProbability (DMX) function tells you the probability that a case with those attributes will have the targeted outcome.

Example of results when the usage is set to PredictOnly and the model is reprocessed:

Bike Buyer

Expression

1

0.55843544003102

In this example, the difference in the model is not significant. However, sometimes it can be important to detect differences between the actual distribution of values and what the model predicts. The PredictCaseLikelihood (DMX) function is useful in this scenario, because it tells you how likely a case is, given the model.

The number that is returned by the PredictCaseLikelihood function is a probability, and therefore is always between 0 and 1, with a value of 0.5 representing a random outcome. Therefore, a score less than 0.5 means that the predicted case is unlikely, given the model, and a score over 0.5 indicates that the predicted case is more likely than the random case.

For example, the following query returns two values that characterize the likelihood of a new sample case. The non-normalized value represents the probability given the current model. When you use the NORMALIZED keyword, the likelihood score that is returned by the function is adjusted by dividing "probability with the model" by "probability without the model".

Sample Query 9: Determining Cluster Membership

This example uses the Cluster (DMX) function to return the cluster to which the new case is most likely to belong, and uses the ClusterProbability (DMX) function to return the probability for membership in that cluster.

Note By default, the ClusterProbability function returns the probability of the most likely cluster. However, you can specify a different cluster by using the syntax ClusterProbability('cluster name'). If you do this, be aware that the results from each prediction function are independent of the other results. Therefore, the probability score in the second column could refer to a different cluster than the cluster named in the first column.

In the previous example, the probability score was not very high. To determine if there is a better cluster, you can use the PredictHistogram (DMX) function together with the Cluster (DMX) function to return a nested table that includes all possible clusters, together with the probability that the new case that belongs to each cluster. The FLATTENED keyword is used to change the hierarchical rowset into a flat table for easier viewing.

By default, the results are ranked by probability. The results tell you that, even though the probability for Cluster 2 is fairly low, Cluster 2 is still the best fit for the new data point.

Note The additional column, $DISTANCE, represents the distance from the data point to the cluster. By default, the Microsoft Clustering Algorithm uses scalable EM clustering, which assigns multiple clusters to each data point and ranks the possible clusters. However, if you create your clustering model using the K-means algorithm, only one cluster can be assigned to each data point, and this query would return only one row. Understanding these differences is necessary to interpret the results of the PredictCaseLikelihood (DMX) function. For more information about the differences between EM and K-means clustering, see Microsoft Clustering Algorithm Technical Reference.

All Microsoft algorithms support a common set of functions. However, models that are built by using the Microsoft Clustering algorithm support the additional functions that are listed in the following table.