This articles explores how you can work with filtered models, a powerful new feature introduced in SQL Server 2008 that allows you build models on slices of a mining structure.

Explore Filtered Models in SQL Server 2008

Introduction

This
article describes how to use model filtering, a powerful new feature in SQL Server 2008
Data Mining. Filtering allows you to take full advantage of the mining
structure / mining model dichotomy, separating the data staging (in the structure)
from the modeling (in the mining model). This means that you can define and
process a single mining structure for your problem space and then build mining
models on specific slices of interest within that space.

SQL
Server 2008 Data Mining allows specifying, for each mining model, a filter to
be applied on the training data. The filter acts as a partitioning mechanism
inside the mining structure and it is applied on top of any training/testing
partitioning already existing at the structure level. Furthermore,
filters can be applied to data in nested tables. Filters allow you to build
different models for various partitions of the data with minimal effort, either
to obtain better overall accuracy (think of it as manually specifying the first
split in a decision tree), or to compare the patterns between different
partitions.

We will cover a lot of ground in
this article. Use the links below to jump straight to the sections you're most
interested in:

Creating Filtered Models

So,
let’s start with a simple scenario, using the same old
Adventure Works data -
building a mining model that predicts how likely a customer is to purchase a
bike based on various demographic information. Let’s also assume that you know -
based on your experience with the data - that the Geography factor has an
important role and you don’t want your North American patterns to hide (because
of volume) the peculiarities of the Pacific market.

In
this section, we will show you both the DMX and UI ways of working with
filtered models. Pick your path and follow it!

To create the structure using BI
Development Studio, you just need to select “New Mining Structure”. On the
“Create the Mining Structure” page of the wizard select “Create mining
structure with no models”. Assuming you already have a data source view for
Adventure Works 2008, select it and then select vTargetMail in the “Specify table type” in the wizard. Select the
following columns and change the content and data types to match those as shown
below:

Now,
the modeling part. As usual, you would create a mining model as follows:

In the UI, you can create the mining
model by selecting create related mining model. Name the model BikeBuyerClass
and select Decision Trees as the algorithm. Once the model is selected you will
have to set the BikeBuyer column to Predict
and the Region column to ignore as
shown below:

Now
you can add a mining model to the mining structure using only the data for the
Pacific area:

The
new syntax element is the FILTER construct, which instructs the model to use
in training only those cases where the Region column contains the ‘Pacific’
value.

UI:

To create the filtered model in the UI,
you will need to create the model first and then add the filter. To use fewer
steps, you can create the new model based on BikeBuyerClass model we just
created. Just right click on the BikeBuyerModel name and select “New Mining
Model” as shown below.

This
can also be accomplished by picking the mining model you want to clone and
using the Mining Model drop down from the file menu and selecting “New Mining
Model”.

Now,
to add the filter you have to select “Set Model Filter” from the Mining Model
menu while the mining model is selected or from the right click contextual
menu. This will bring the Model Filter dialog.

The
dialog presents data in tabular form. Each row is a condition. Each condition
can have up to four columns:

·And
/ Or: This will do a logical And/Or between the different conditions. Valid
values are “And” and “Or”. The value can only be set from the second condition
on. Current version doesn’t allow condition grouping from the grid.

·Mining
Structure Column: This can be any column from the mining structure, values are
prefilled so you only need to select a column name from a drop down. Only
columns that will not show are key columns since you can’t filter on keys.

·Operator:
This is a drop down with valid operators based on the data type of the selected
mining structure column (more on this below).

·Value:
This is the value used to compare against the structure column using the operator. Value is not always necessary
(more on this below).

As
you can see above, the same expression used in the DMX version of the filter is
displayed in the Expression portion
of the dialog.

Note:
An expression can be edited manually by selecting “Edit Query” in the dialog.
Be careful: once the expression is edited you will not be able to use the grid
to add elements to the filter. Everything will need to be added by hand in the
Expression text box.

Clone
DT_Pacific and edit the filter to be North America instead of Pacific as shown below.

Note that if you clone the DT_Pacific
model to create the DT_NorthAmerica model using the steps described above, the
filters in the base mining model will also be cloned. This can be a real time-saver when creating multiple models with similar filters
- the time-saving adds up when you start building complex filters.

DMX Deep-Dive on
Filter Syntax

As you
notice, FILTER is syntactically a function, taking a Boolean expression.
The Boolean expression may be a simple predicate (of the
Column=Value kind), an EXISTS predicate or a Boolean combination
(using AND, OR and NOT operators) of multiple
predicates.

The
Column part of a simple predicate is always the name of a mining
structure column.

The
corresponding mining structure column does not have to be included in the
mining model. In the examples above, Region is no longer an interesting
attribute when all the training cases seen by a model have the same
value (e.g. Pacific), therefore it is not part of the model.

A
simple predicate is defined as <Column> , where
the accepted set of operators depends on the content type of the column,
and Value must always be a constant. Value generally has
to have the same type as the column but can also be the NULL
constant.

UI:In the Filter dialog in BI Studio the NULL constant is
denoted by two added values in the operator drop downs as IS NULL and IS
NOT NULL to demote <Column> = NULL or <Column> <> NULL
respectively.

If the
column is:

A
case level KEY - then no operator can be applied (no key filters are
supported) . Nested KEY on the other hand is just a mapping between
case and nested tables and can be used to filter (more on nested
table filtering later)

DISCRETE or DISCRETIZED - supports the = and <> operators and also
IS NULL and IS NOT NULL in the Filter dialog

CONTINUOUS - supports the =, <>, , >= operators and also IS
NULL and IS NOT NULL in the Filter dialog

XML
serialization of the floating point double numbers as well as rounding
errors may result in losses of less significant digits, therefore using
an = or <> operator with a double value may result in unexpected
results, so it is safer to use a combination of other predicates.

In case
of discretized (numeric or date/time) values, the = and <> operators
have a semantic similar with the PREDICTION JOIN operator: they evaluate
the bucket that contains the value and then apply the predicate to the
bucket index. Therefore, for a discretized YearlyIncome column with
buckets (0, 10000], (10000, 20000] etc. the filters below mean exactly
the same thing:

(YearlyIncome=5000)

(YearlyIncome=9000)

(YearlyIncome=999.999)

Applying a filter on a mining model means, effectively, training the
mining model with the data returned by a query like below:

SELECT
… FROM MINING STRUCTURE TestStructure.CASES
WHERE IsTrainingCase() AND ()

For
those of you who use OLAP mining models, filters are somewhat similar
to defining a cube slice on a mining structure - but at the model level.

If your
filtered model has drillthrough enabled, then all the
structure cases that match the filter will be linked with a content
node, not only the training cases from the structure. Of course, only
the structure training cases matching the filter will be
used in training.

Model Accuracy in Filtered Models

Accuracy
stored procedures can be applied on various data partitions (test or training).
Filters add a new dimension to the data partitions, which can now be filtered
test data or filtered training data. Accuracy stored procedures have syntax
like below:

CALL SystemGetLiftTable(BikeBuyerClass, 2, ‘BikeBuyer’, true)
// - to get the lift chart for the TRUE state of the BikeBuyer attribute

CALL SystemGetClassificationMatrix(BikeBuyerClass, 2,
‘BikeBuyer’) // - to get the full classification matrix for the BikeBuyer
attribute

The
second parameter (2) is a bitmask identifying the data partition(s) to be used
in the accuracy stored procedure. Here are some bit values:
- 1 identifies the training
partition
- 2 identifies the test
partition
- 3 (bitwise OR combination of
1 and 2) identifies ALL the data in the mining structure (training + test)

Filters
introduce another bit in the bitmask, with a value of 4. Therefore, here are
some new values for the second parameter of the accuracy stored procedures:

-
5 (4 bitwise OR 1) identifies the filtered training data seen by the model in
training
- 6 (4 bitwise OR 2) identifies the filtered test data - those test cases
from the mining structure where the model filter applies

-
7 (4 bitwise OR 1 bitwise OR 2) identifies all the cases (training or test) in
the structure where the model filter applies

Therefore
the following calls might return different lift tables:

CALL SystemGetLiftTable(DT_Pacific, 2, ‘BikeBuyer’, true) // -
to get the lift chart for the TRUE state of the BikeBuyer attribute on all the
test cases in the structure

CALL SystemGetLiftTable(DT_Pacific, 6, ‘BikeBuyer’, true) // -
to get the lift chart for the TRUE state of the BikeBuyer attribute on the test
cases in the structure that match the filter defined in DT_Pacific

Filtering Nested Tables

The
most common use of nested tables in SQL Server Data Mining is to model
transaction tables. That is, model Customer entities together with the “bag” of
products they purchased.

Nested
table filters primarily serve two purposes:

Use (in the
model) only those nested table rows with a certain property (e.g. consider
only products that are not “Coca Cola”, because most people do buy “Coca
Cola”, so it is not that interesting)

Use (in the
model) only those cases where the associated nested tables have certain
properties (e.g. - build a prediction model on customers that have
purchased Coca Cola)

Notice
the difference between the filters: while the first filter removes products
from shopping baskets before modeling, the second removes customers from
the model (those that have not purchased Coca Cola).

Let’s
start with a mining structure containing customers, some demographic
information and a list of products they bought:

Now,
the modeling part: for the first kind of filter described in the beginning, the
goal is to build a model that predicts some, for example, demographics
(IncomeGroup) based on the products a customer purchased, without considering a
certain product (say, the ‘Mountain-500′ bike):

The
syntax of the nested table filter is very similar with the syntax of model
filters - it effectively specifies a WHERE clause to be applied on the rows of
the nested table. The training data for the model can be obtained with the
following drillthrough query:

For
the UI create a structure with no models and add a new Model setting the column
LineNumber to ignore (as described in
the previous section). The main difference you will notice from the previous
example is that now if you select the nested table column a new element is selectable
in the Mining Model / Contextual menu: Set
Nested Table Row Filter. Select this option to open the Nested table filter
dialog. You will now you opened the correct dialog because it will say Nested
Table Row Filter for <Model>.
The only items that will be available in the filter are nested table columns.
The filter for IncomeGroupFromProducts is shown below.

The
syntax changes for the second filter described before. Now the goal is to build
a model predicting IncomeGroup only for those customers that did
buy a certain product:

The
new syntax element is the EXISTS predicate, which evaluates to TRUE if the
query specified as argument returns at least one row.

Note
that the list of columns for the EXISTS query is never used, so * is as good as
anything else. The WHERE clause of the EXISTS predicate supports, in the case
of nested table filters, only the simple predicates (or Boolean combinations of such predicates)
described before.

The
same filter can be applied using the Model Filter dialog in BI Studio. The main
difference will be the available mining structure columns where you can select
any of the structure’s columns or the nested table. You need to select the
nested table as the Structure Column. Valid operators for nested tables are: Contains which translates into
Exists and Not Contains which is the equivalent of Not Exists.

If
you select the nested table then a ‘plus’ sign will appear to the left of that
condition, this means you can expand the nested table filter. The nested table
conditions work pretty much as regular conditions with the same <Column> <Operator> <Value>
rules. Everything inside the nested conditions will be used to create an Exists clause (with or without the
Not). You can have multiple of these
clauses and use AND / OR to create Boolean combinations. You can see the filter
for IncomeGroupFromMountain500 below:

The
training data for the model can be obtained with the following drillthrough
query:

The
query actually works in DMX, which implies that DMX contains now the EXISTS
function which evaluates to TRUE if the sub-query argument returns at least one
row. Note that, when used in a regular DMX query (i.e. not in a filter), the
EXISTS sub-query can be applied to any nested table (including function
results, such as PredictHistogram) and can take any WHERE clause that is valid
in that context, including UDFs (therefore, outside of a filter, the EXISTS
clause is not limited to structure table columns and simple predicates).

Combinations
of filters and other SQL Server 2008 DMX features can be used to have very
specialized data views, allowing complex modeling scenarios.

Multiple Filters on
Nested Tables

Notice
how the structure above contains a LineNumber column in the nested table, a
numeric field indicating the category of the product on the same line. The
model below detects cross-sales rules that lead from products in line 1 (bikes)
to products in any other category:

LineNumber1Products
is an input nested table, based on data in the Products structure column,
and containing only products in the category denoted by LineNumber 1

OtherLineNumberProducts
is a predictable (PREDICT_ONLY) nested table, also based on data in the
Products structure column, but containing only products in the categories
other than the one denoted by LineNumber 1

The model is
trained only on those customers that have both at least one product in the
LineNumber 1category and at least one product in a different category,
because any other customers are not relevant for the problem the model is
trying to solve.

All the filters
(both the nested table row filters and the model case filter) are applied
on mining structure columns which do not necessarily show in the model
(the case of LineNumber). When they do show in the model under an alias,
the filter is applied on the source structure column name.

UI: The same can be accomplished in BI Studio by
creating Nested Filters for each nested table and then on Model filter with
both Exist clauses.

This tip provides insight into how the Attribute Discrimination pane in the Naive-Bayes viewer uses a system stored procedure to get its data

Under the Hood: How the Naive-Bayes Attribute Discrimination Viewer Gets Its Data

If you have ever wondered about how
the DM viewers get the data to display on the screen, read on.

In many cases, what is displayed in the data mining viewers is the result of
built-in stored procedures which allow the processing required for the view to
be done on the server without requiring all of the model content to be brought
to the server. In this "Under the hood" tip, we will dive into one of
those stored procedures.

Naive-Bayes Attribute Discrimination View

The attribute discrimination view for the Naive-Bayes algorithm shows the differences in the input attributes across the states of
an output attribute. It generally looks like this:

The viewer doesn’t download all of the correlations in the Naive-Bayes content,
rather it calls the stored procedure GetAttributeDiscrimination like this:

The not-so-obvious parameters are, in order strModel, strPredictableNode,
strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize.
Let’s go through these parameters:

strModel – The name of the model.

strPredictableNode – This one is a bit difficult, as it takes
the Node Unique Name of the target attribute instead
of just the string you see in the viewer. The Node Unique Name identifies
the attribute in the content rowset generated by the model. You can get
the list of predictable attributes and their Node Unique Names by calling
another stored procedure – like this CALL
System.GetPredictableAttributes('ModelName').This stored
procedure returns two columns – one for the attribute name and one for the Node
Unique Name.

strValue1 – The name of the value you want to compare on the
left hand side. The usage of this parameter depends on the value of the
next parameter.

iValType1 – This parameter indicates how to treat strValue1.
It can have values 0,1, or 2. If this parameter is a 1, the value in
strValue1 is the actual state of the attribute. However, if this parameter
is a 0 or 2, the value in strValue1 is ignored. If the value is 0, the
left-hand value is considered to be the “missing state”. If the value is
2, the left hand value is considered to be “all other states.” In the
example above, “All other states” is specified only because it looks nice (and
it’s easier to just drop the combo box value into the function call even if it
will just be ignored).

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such
that small correlations don’t come back in the results.
Usually you set it to a really small number like 0.0005 in the example above.

bNormalize - Whether or not the result is normalized.
If this value is true, the results are normalized to a maximum absolute value of
100, giving a possible range of –100 to 100. All this does is take the
largest absolute value in the result and divide that into 100, and then multiple
all the other numbers by that amount. If set to false, the numbers are
whatever they are and you can figure it out yourself – it’s up to you, but the
NB viewer
always sets this to true.

The Results

Calling this routine returns a row for every differentiating attribute/value
pair with a score higher than the specified threshold. The row contains
the differentiating pair along with the score and some other columns and looks
somewhat like this:

The score column is the “important” one and is best explained
as if you did something like a C language compare routine e.g
int Compare(int v1,int v2) { return v1-v2; }
. That is, if the value is positive, it favors value1 and if the value is
negative, it favors value2. The other
columns are the actual counts of the correlations of
the discriminator against the inputs. The best way to understand them is
to look at the Mining Legend as you browse the model and click on rows.
For example, if you clicked on the first row of the result above (in either
picture), the Mining Legend would look like this:

Of course, once you have the result set you can use it wherever you want – in
Reporting Services, Integration Services, or in your custom program.

Sidebar: How Do I See What Queries Are Being Sent by the Viewers?

Run SQL Server Profiler.

Start a New Trace from the File Menu and connect to Analysis Services.

Leave all the defaults on for the Trace Properties dialog that appears.

Go to the data mining viewer you're interested in digging into, and browse a
model.

Profiler will show you the queries that are being sent. Click on a "Query Begin"
event line in the top pane to see the full query text in the bottom pane.

Learn how to use a new SQL Server 2008 feature to get more accurate predictions when you don't have enough historical data for a series.

Predicting future steps when you do not have enough history

No History? No Worries!

Say you’re launching a new product and you want to
predict what sales might look like in the next few months. Classical time series prediction does not
work in this scenario because you don’t have historical sales data for the
product. However, new SQL Server 2008
extensions to the Microsoft_Time_Series algorithm and DMX allow you to easily
apply the patterns from another similarly-behaving time series to solve this
conundrum.

Predicting Post-IPO Stock Values for VISA

In this
example, we illustrate how to use the PredictTimeSeries method with the
parameter REPLACE_MODEL_CASES to obtain predictions for a time series for which
we do not have enough historic data. The power of this combination comes into
play when we have a time series with not enough historic data points to build a
realistic model, but we know that this series follows a pattern similar to
another time series for which we have enough historic data to build a model.

Here’s an Excel 2007 workbook
that has 73 historic data points representing post-IPO daily closing values for
the MasterCard stock and just 8 corresponding values for Visa (since we’re
doing this 8 days after the Visa IPO). Our goal is to use the MasterCard stock
history to derive better predictions for the Visa stock.

8.On the last page of the wizard, rename the
structure “MasterCardStructure” and the model “MasterCardModel”, leave the
default selections to browse the model after it is created and to allow drill
through, and click “Finish” to end the wizard and proceed to build the model.

The MasterCard model historic data and the first 10
predicted values are illustrated in the following graph:

Now, use the same steps to create a time series model for
the Visa stock using the 8 historical data points on the second workbook sheet.
You will see right away that the model will not
generate meaningful predictions due to the lack of sufficient historic data
points. The VisaModel historic data and the next 10 predicted values are
illustrated in the following graph:

Better Predictions Using REPLACE_MODEL_CASES

A better approach is to use the knowledge that the Visa
and MasterCard stocks have a similar pattern and to use the model built for MasterCard
to obtain predictions for the Visa stock values. Here’s how (again using the
Data Mining Client Add-in for Excel):

1.Select the “Query” task from the “Data Mining”
ribbon and click "Next" on the
introductory page.

2.Select the “MasterCardModel” model and click the
“Advanced” button.

3.On the “Data Mining Advance Query Editor” page,
click on the button “Edit Query”, select Yes on the dialog asking to confirm
that “Any changes to the query text will not be preserved when you switch back
to the design mode.”

5.Click “Finish” and select the results of the
query to be copied into a new worksheet.

The results should look like this:

When the REPLACE_MODEL_CASES parameter is used, the
PredictTimeSeries method will return the requested number of prediction
obtained by replacing the last historic points of the given model with the new
values provided in the query. In our case, the last 8 data points for the
MasterCardModel are replaced with the values we generate on the fly using the
SELECT and UNION options in the input set specified after the “NATURAL
PREDICTION JOIN” keywords. Then, the MasterCardModel equations are used to
predict the next 10 values for the Visa stock series.

To see the power of this method, we can compare the
predictions obtained using the MasterCard model (Predictions.Visa), with the
predictions generated by the VisaModel model obtained using only the limit sets
of 8 data points of the Visa stock values (Predictions.Visa2). The results are illustrated
in the following graph:

So there you go - you have a new tool in your arsenal when
you don’t have enough data to make accurate time series predictions. Enjoy!

This tip shows you how to take the tedium out of putting together and verifying the correctness of DMX prediction queries for models with multiple nested tables - by generating them programmatically!

Scenario

Imagine you built a model with multiple nested tables and multiple predictable columns inside each nested table. In your application you want to run a prediction query against the model and gather prediction statistics for each predictable attribute. Such queries are usually long and tedious to write, plus they are hard to debug if you made a minor slip. This tip shows you how to programmatically generate such a query using C# and ADOMD.NET.

Notice that the column names in the mining model have spaces (typical if you created the model using BI Development Studio). The column [Product Color New] has been renamed from the original column name. The nested table [Stores With Products] is created from a join of the two source tables. Our goal is to generate a PREDICTION JOIN statement which in the most generic form looks as follows:

SELECT [FLATTENED] [TOP n] select_expression_list

FROM model | sub_select [NATURAL] PREDICTION JOIN

source_data_query [ON join_mapping_list]

[WHERE condition_expression]

[ORDER BY expression [DESC|ASC]]

Building the Prediction Query

First let us consider the list of columns we want to generate in our select_expression_listfor the Prediction Join query:

1.For each predictable column in the case table, we want to include the equivalent column from the data source if it exists. This enables us to compare the actual value to the predictable value in our application.

2.For each predictable column in the case table, we want to make available the following statistics:

a.Discrete Column: The histogram of state with probability for each state.

b.Continuous Column: The mean, STDEV and the probability

c.Discretized Column: The Min, Mid, Max and the probability.

From our example above, the query fragment for #1 and #2 would look as follows:

3.For each nested table and for each predictable, we want to include the equivalent column from the data source if it exists. This enables us to compare the actual value to the predictable value in our application.

4.For each predictable column in the nested table and for each unique nested table key, we want to output the same data as #2.

From our example above, the query fragment for #3 and #4 would look as follows:

This tip
will dive into the third way. We will show you how to use a linked server to
execute DMX queries from the SQL Server relational engine; once you can do
that, it’s simple to manipulate the data to your heart’s content using T-SQL
functions and save it to a table using SELECT-INTO.

Linking SQL Server to Analysis Services

Establish
a link to an AS server as follows:

EXECsp_addlinkedserver

@server='LINKED_AS',-- local SQL name
given to the linked server

@srvproduct='', -- not used (any value will do)

@provider='MSOLAP',-- Analysis
Services OLE DB provider

@datasrc='localhost',-- Analysis Server
name (machine name)

@catalog='MovieClick' -- default
catalog/database

Alternatively,
you can use SQL Server Management Studio to create the linked server:

Expand the
Server Objects node under the top level node for your server in Object
Explorer.

Right-click on
Linked Servers and select “New Linked Server …”.

Enter a name for
the Linked server (“LINKED_AS”).

The “Other data
source” radio button should be selected by default for Server type.

Select the
latest version of the Microsoft OLE DB Provider for Analysis Services (10.0
for SQL Server 2008) for Provider.

Enter any name
in the Product name field (“DM” will work).

Enter the name
of the Analysis Server instance hosting your mining model(s) in the Data
source field.

Enter the name
of the AS database containing the model(s) you want to query in the
Catalog field.

Click OK.

(Note:
You can generate the “EXEC spaddlinkedserver”
statement by right-clicking on the newly-added linked server and selecting “Script
Linked Server as …”.)

Running DMX Queries from T-SQL

Now you can
do data mining queries from the SQL Server relational engine with T-SQL and insert the
results into a SQL table using OPENQUERY like this:

SELECT*INTO DMResults FROM

OPENQUERY(LINKED_AS,

'SELECT Cluster() AS [Cluster], ClusterProbability() AS
[Prob]

FROM
[Customers - Clustering]

NATURAL
PREDICTION JOIN

OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')

Of course,
you can also do all kinds of manipulations on the results that may not have been
possible in straight DMX. For example, finding the average cluster probability
of each cluster becomes simple. Well, almost – since the data type returned by
the Cluster() function is something that GROUP BY does
not support, you have to do some casting first. The actual query would
look like this:

That will
give you a nice result showing you, in a way, the affinity of each cluster
based on the input set. That is, if you ran such a query against the
training data, you could say that the clusters with a higher probability are
"tighter" than the ones with low probabilities.

And this is
is just the starting point. We are sure you will unleash your creativity and
SQL skills to come up with page-long T-SQL queries on top of DMX.

Key Things to Remember

When you’re putting together the DMX to embed in
the OPENQUERY from SQL Server, keep the following in mind:

1. Double
your single quotes.

2. Flatten
nested results. Even if you are not explicitly including a sub-select or
table-returning expression in your query, remember that a simple “SELECT *”
might include a nested table. Example: “SELECT * FROM model.CONTENT”.

3. Alias
function calls like PredictProbability(). If you have multiple function
calls in your DMX statement, they will all have a default column name of
"Expression" and T-SQL will complain about that.

This article highlights a new downloadable sample that allows you to explore all the new Time Series enhancements in SQL Server 2008.

SQL Server 2008 has powerful new Time Series features that you will definitely want to explore and put to use. We have built the SQL Server 2008 Forecasting Sample app to help you get up to speed. The tool features readily-accessible time series data, an easy-to-use interface to model and play with the algorithm features and an explanation window with a link to the server-side DMX queries so you can understand what's going on under the hood.

Time Series Enhancements in SQL Server 2008

The sample highlights the following SQL Server 2008 Time Series enhancements:

Building time series models using either the ARTXP algorithm carried over from SQL Server 2005, the new ARIMA algorithm in SQL Server 2008, or MIXED mode that blends the two algorithms using a smoothing parameter

Prediction with new data using a NATURAL PREDICTION JOIN on a time series model

Adding new training data on the fly during prediction using EXTEND_MODEL_CASES prediction flag

Predicting a new series based on the model for a similar series using the REPLACE_MODEL_CASES flag

This tip shows you how to create server data sources directly from DMX by calling a stored procedure (that you can build and deploy using the accompanying source code).

Creating a data source from DMX

As we know, DMX is not just a query
language - it includes syntax for creating and
processing mining structures and models as well. So
you could choose build your entire data mining
project using DMX, except for one thing: to train
your mining structure/model using INSERT-INTO, you need a
data source object and DMX doesn't give you a way to
create one!

Fortunately, this is easy to fix - just
write a stored procedure that creates data source
objects for you, deploy it to the server and viola,
you can create as many data sources as you want
directly from your queries without having to write
C# code each time to do so.

Deploying the Stored Procedure

You can create a C# class library
project in VS and replace the contents of the
Class1.cs file with the code below or you can
download the project and just fix up the references.

Follow these steps to build and
deploy the project:

Add a reference to the msmgdsrv.dll (Microsoft.AnalysisServices.AdomdServer.dll)
class library from the Browse tab.This class library is available in the location where Analysis Services
2005 is installed By default, this location has the form: C:\Program Files\Microsoft SQL Server\MSSQL.2\OLAP\bin.Please note that, depending on your installation options, MSSQL.2 might be
MSSQL.1 or MSSQL.3 or so.

Add a reference to "Analysis
Management Objects" from the .NET tab.

Build the project.

Deploy the stored procedure on
the server:

Open SQL
Server Management Studio.

Connect to
the target Analysis Services server instance.

In the Object
Explorer, select the 'Assemblies' node at the server level.

Right-click
and select 'New Assembly'.

For the
FileName, field, use the
Browse (...) button to navigate to the location where the stored
procedure was built. A file named CreateDataSourceSP.dll should be available
at that location, under the bin\debug folder. Select that
file.

Assembly Name
should be automatically filled with CreateDataSourceSP.

Set Permissions to
Unrestricted.

Set Impersonation to "Use the
credentials of the current user".

Click OK to
deploy the stored procedure.

Using the Stored Proc from DMX

You can now call the stored procedure to create a
new data source using the following syntax in DMX:

This tip shows you how to get all the pieces you need to try out the new SQL Server 2008 DM Add-Ins for Office 2007.

The final release for SQL Server 2008 Data Mining Add-ins for Office 2007 is available for download. Here are the steps to setup the add-ins on your machine and play with the new features:

Download the evaluation version of SQL Server 2008 and install Analysis Services (client and server components). This should install .NET Framework 2.0 and ADOMD.NET, both prerequisites for the add-ins.

If you do not have Microsoft Visio 2007 installed on your machine, download and install the trial version of Microsoft Visio Professional 2007 (Windows Live sign-in required).This is required only if you want to use the SQL Server 2008 Data Mining Add-in for Visio.

Download and install the add-ins. Only the Table Analysis Tools add-in is installed by default so make sure you select all components during setup if you would like to test-drive all three add-ins (there is no reason not to).

Follow the Getting Started wizard (which launches the first time you run Excel 2007 or Visio 2007 after installing the SQL Server 2008 DM Add-ins) to configure your Analysis Services instance correctly for using the add-ins. If you miss it for some reason when starting Excel 2007 or Visio 2007, the Getting Started wizard is available from the Help button under the Data Mining tab (ribbon) in Excel 2007 as well as from Start menu -> Programs -> Microsoft SQL Server 2008 DM Add-ins.

Open up the sample Excel workbook provided with the add-ins (Start menu -> Programs -> Microsoft SQL Server 2008 DM Add-ins -> Sample Excel Data), click within the table on the Table Analysis Tools Sample sheet and select the Analyze tab under Table Tools to see the Table Analysis Tools ribbon. Or alternatively, you can create your own table in Excel by selecting a range and clicking the Format as Table button from the Home tab (ribbon).

You can get more information about the new features in this version of the add-ins on this page. You can learn more about all the cool features carried over from the SQL Server 2005 version here.

This tip shows you how you can apply the cross-validation feature in the upcoming SQL Server 2008 release to estimate if the training set size is sufficient for a given model.

How much training data is enough?

This is a question that frequently
comes up: when do you know that feeding more data to the mining
model will stop improving accuracy? We think the accuracy
procedures in
SQL Server 2008 might be able to help you get some
reasonable estimates.

Background

Algorithms learn from data and,
generally, feeding them with more data will improve the
accuracy. This improvement, however, is asymptotic. If the
training data is statistically representative of the problem
space, then a classifier will stop learning after a certain
limit: new data points will support existing patterns and will
help reducing the risk of over-training, but will not add more
information to the model. The solution we suggest for SQL Server
2008 allows you to figure out if your mining model is likely to
gain from adding new data or is likely to maintain the same
accuracy.

The solution is based on the
new cross-validation feature in SQL Server 2008, so we
will start with a small presentation of this mechanism.
Cross-validation takes a data set, splits it into a number of
(roughly equal size) partitions, called folds, then builds a set
of mining models (as many as there are folds). Each mining model
is built on top of the whole data set minus one
of the folds and then the accuracy of the model is measured on
the fold that was left out. Effectively, all data is used in
training and all data is left out (by at least one mining
model). Also, each model is validated against data that was
not seen during training. The accuracy measures
for each model are then evaluated across all the folds. A good
mining solution would provide compact accuracy measurements
(i.e. no big differences between the measurements on different
folds). Spikes in the measurements may indicate problems with
the data. Note that cross validation does not improve the
accuracy of the model, it just provides an “honest” estimation
of that, reducing the risk of over-training.

In SQL Server 2008, cross
validation allows the user to specify how many cases should be
used for the procedure. This is mainly for performance purposes
— it allows a quick comparison between different mining
models/algorithms on the same data set.

Using Cross-Validation To Determine
Data Sufficiency

Now back to the original
problem: would adding new data benefit accuracy?

The cross-validation stored
procedure supports multiple syntax flavors. The one that is
interesting for our purpose is described below:

CALL
SystemGetCrossValidationResults( ,
, , , ‘’)

The first two parameters are
straightforward: the mining structure containing the data and the mining
model to be cross validated. (Note that the stored procedure
does not require either the mining structure or the model to be
processed; if mining structure is not processed, the stored
procedure will process it first, but not its contained models).
As we mentioned, the procedure creates a number of models and
the source model is used as a template.

The Number of Folds
parameter specifies how many folds should be used in the
procedure while the Number Of Cases parameter specifies
how much is the data set to be used by the procedure. The Target
Attribute parameter is used for:

ensuring stratified sampling
for the folds (the distribution of the Target Attribute’s states
is almost the same in all the folds)

directing the accuracy
measurements to be performed on the mining models built by the
procedure.

A typical call is would set
NumberOfFolds to 10 and NumberOfCases to, say,
1000. Such an invocation builds a 1000-case sample (using random
selection)with ten 100-case folds over the sample. 10
models are built, each trained with 900 cases and evaluated
against the remaining 100.

To see if new data would
improve the accuracy, we could build models starting with a
small training set, then incrementally increase (for example,
double) the training set size until the accuracy flattens.

The Algorithm

Here are the actual steps of
the algorithm:

Step 1: Start with an
existing mining structure, using a reasonably large data set

We started with a mining
structure, MyStruct, built on top of the [v Target Mail] view of
the
Adventure Works BI sample database.

Step 2: Define a mining
model in the structure.

We created a simple model,
ClassifyBikeBuyer, predicting Bike Buyer and also English
Occupation. We used Microsoft Decision Trees with all the wizard’s
default suggestions, with the exception that we changed Bike Buyer to
"Discrete
Boolean".

Step 3: Start with a
rather small training set size (say, 10 — it is highly unlikely
that 10 is a good training set size, but it makes the charts
look good :-) )

Step 4: Execute the
cross validation procedure using 2 as the number of folds and 2*TrainingSetSize
as the number of cases. This step is the interesting part of the
algorithm: we have to use at least 2 folds, or the procedure
would fail. On the other hand, we want to use
at least 2 folds, so that the accuracy measurements are not
biased.

Step 5: Record the
accuracy results for the current training size. The stored
procedure result contains multiple accuracy measurements for
each model built by the procedure. As the number of folds
argument was 2, there are two sets of results:

Now, there are a few things to
be recorded in the result set:

- First, the values of the
accuracy measurements for each fold. we used a single measure,
the Log Likelihood Score, a value that, by itself, does not
depend directly on the training or test set size (Pass/Fail
classification scores do depend on these)

- Then, the partition size for
each fold. This is very important, because it is the termination
condition for the algorithm (we do not know ahead of time how
much data is in our mining structure). The folds are almost the
same size; the stratified sampling mechanism may lead to minor
differences. If at least one fold has a size exceeding or equal
to our training set size, the procedure had enough data (see more
details on Partition Size in Step 6).

To simplify the results for
this execution (training set size = 10 cases), we stored the μ
(average) and σ (standard deviation) for the Likelihood Log
Score measurement

Step 6: Double up the
training set size and repeat from step 4 until not enough data
is found.

The Partition Size in the
result is not the size of the training data for
the model indicated by Partition Index. Every row marked with
Partition Index = 1 is about the model trained leaving partition
1 out (and, therefore, evaluated on partition 1).

The procedure above will stop
when step 4 asks for a data set size which exceeds the data
available in the mining structure.

The Output

Let's examine the results of
this process.

Our sample mining structure has
around 13k cases (the view contains 18k and we preserved 30% as
holdout). Therefore, the procedure worked correctly for data
sets of: 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 and 10240
cases (so it stopped with a training size of 5120 cases —
remember, data size is split in 2 folds!). When asking for 20480
cases, the procedure only found enough data to create 2
partitions of 6470 cases. To evaluate the accuracy of (almost)
the whole training set in the structure, we added one more call:

Effectively, we asked the
procedure to perform 10-fold cross validation, so we can
evaluate the accuracy of a model using 90% of the training data.
Not exactly the whole data set, but close enough!

We repeated the same steps for
the ‘English Occupation’ target and plotted the results in
Excel. The chart below presents the accuracy evolution
(Likelihood Log Score) with the size of the training set, and
the error associated with each data point is the standard
deviation for the likelihood log score among the folds of that
size.

So, what does the chart tell
you?

Likelihood Log Score is a
measure that cannot exceed 0. The closer it is to 0, the better
is the accuracy of the model. The chart above suggests that:

However, the data does not
seem enough to predict the English Occupation. As you can
see, the slope is still pretty steep on the last segment of
the English Occupation line.

And Now, The Code

The algorithm seems interesting
enough to apply it later, so we encapsulated it in a small
C# Analysis Services stored procedure.

All the code is available in
this file:
DataSufficiency_AccuracyEx.cs. To use it, you will need to download the
file, create a new C# class library project, add a reference to
Microsoft.AnalysisServices.AdomdServer library and include the
file in the project.

Once you build and deploy the
stored procedure (Safe permission set, impersonate current
user), you can invoke it with a call like this:

This tip provides insight into the inner workings of the Microsoft Naive Bayes algorithm, showing how the algorithm computes the score used to filter out correlations.

[Note: This article is the first in a new "Under the Hood" series that explores the inner workings of the SQL Server Data Mining algorithms and infrastructure. Enjoy the deep dive!]

Minimum Dependency Probability?

When you create a Microsoft_Naive_Bayes model in SQL Server 2005 Data Mining, you can set an algorithm parameter MINIMUM_DEPENDENCY_PROBABILITY that specifies the minimum dependency probability between the input and output attributes. The algorithm then filters out input attributes that have a dependency probability lower than the set threshold. In this article, we show how exactly to compute the dependency probability between an input and an output attribute. The main idea behind this comes from the research paper titled “A Bayesian Approach to Learning Bayesian Networks with Local Structure”.

A Naïve Bayes model can be treated as a special case of a Bayesian network which has conditional probabilities at every node. For a Naïve Bayes model, the dependency is just between input attributes and the output attribute only.

Computation Walkthrough

Let us consider a simple dataset with one input attribute “Number of Children” and one output attribute “Bike Buyer”.

Bike Buyer

Number of Children

Count

No

0

1

No

1

9

No

2

35

Yes

0

40

Yes

1

10

Yes

2

5

Let

X = Score of the Bayesian Network without the split on “Number of Children”

Y = Score of the Bayesian Network with split on “Number of Children”

P = dependency probability of the node

By definition the dependency probability is proportional to the score which gives us:

P/ (1-P) = Y / X

Taking natural LOG on both sides:

LOG (P / (1 – P)) = LOG Y – LOG X

For computing the score for a split, we use a function LG which is the Log Gamma function. See the end of this article for an explanation on how to compute the log gamma function.

We first evaluate X using the following table:

Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)

Number of cases = 100

Bike Buyer = missing

LG(1 + 0) – LG(1) = 0

Bike Buyer = 0

LG(1 + 45) – LG(1) = 129.124

Bike Buyer = 1

LG(1 + 55) – LG(1) = 168.327

Bike Buyer

LG(3) – LG(3 + 100) = -375.82

LOG X = -74.835

We then evaluate the value of Y by splitting the data set using the three different values of Number of Children and computing the partition score in all the three cases.

Number of Children = 0

Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)

Number of cases = 41

Bike Buyer = missing

LG(1 + 0) – LG(1) = 0

Bike Buyer = 0

LG(1 + 1) – LG(1) = -4.4e-11

Bike Buyer = 1

LG(1 + 40) – LG(1) = 110.32

Bike Buyer

LG(3) – LG(3 + 41) = -123.526

Partition Score = -10.519

Number of Children = 1

Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)

Number of cases = 19

Bike Buyer = missing

LG(1 + 0) – LG(1) = 0

Bike Buyer = 0

LG(1 + 9) – LG(1) = 12.801

Bike Buyer = 1

LG(1 + 10) – LG(1) = 15.104

Bike Buyer

LG(3) – LG(3 + 19) = -46.679

Partition Score = -16.781

Number of Children = 2

Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)

Number of cases = 40

Bike Buyer = missing

LG(1 + 0) – LG(1) = 0

Bike Buyer = 0

LG(1 + 35) – LG(1) = 92.136

Bike Buyer = 1

LG(1 + 5) – LG(1) = 4.787

Bike Buyer

LG(3) – LG(3 + 41) = -119.741

Partition Score = 20.155

LOG Y = -47.455

Node Score = LOG y – LOG X = 27.379

Node Probability = 0.99999

The above example shows a very high probability for attribute “Number of Children” predicting the value of “Bike Buyer” which is apparent from the data set. Let us consider another example where “Number of Children” is a very poor predictor for “Bike Buyer”

:

Bike Buyer

Number of Children

Count

No

0

10

No

1

15

No

2

25

Yes

0

10

Yes

1

15

Yes

2

25

Using the same score computation method, the values obtained are as follows:

Node Score = LOG y – LOG X = -6.965

Node Probability = 0.000943

We see that the attribute “Number of Children” has a very low probability of predicting the Bike Buyer state.

How Do I Verify This?

Simple - use the accompanying download for this tip. It includes an Analysis Services project that uses an Access .mdb file with two data tables, “BikeBuyer1” for the first case and “BikerBuyer2” for the second, and creates two mining structures with a Naïve Bayes model for each case. It sets the algorithm parameter MINIMUM_DEPENDENCY_PROBABILITY to 1e-7 in the second case so that the node “Number of Children” shows up and also sets each attribute type to “Discrete”. Once the models are processed, you can use the Mining Content viewer under the “Model Viewer” tab in Business Intelligence Development Studio to look up the node score. The Mining Content viewer will have a column MSOLAP_NODE_SCORE which shows the value of the node score for Number of Children where the Node Type is 10.

Sidebar: Computing Log Gamma

There are several numerical methods available to compute the gamma function and hence the natural logarithm of the value which we term as the log gamma function. The series is represented as follows:

This tip shows you how to apply new DMX syntax to build a False-Positive/False-Negative report using Reporting Services.

In a previous tip, we explored new syntax introduced in SQL Server 2005 SP2 that allows you to pass column references in place of parameters in DMX queries. One of the nice scenarios this enables is generation of accuracy reports using Reporting Services. We will show you how to build one of these in this tip.

The Scenario

Let’s imagine that we are working for a bank that wants to detect possibly fraudulent transactions and investigate them. It might cost the bank just a few dollars to investigate a transaction that is flagged as fraudulent. On the other hand, the bank may lose several thousand dollars on every transaction that was in fact fraudulent but was not reported as such. Our goal is to minimize the bank’s expenses by detecting the probability threshold at which transactions should be reported as fraudulent.

The Cost of False Positives and False Negatives

A prediction is considered a true-positive (TP) if the probability of the prediction is above a certain threshold and the predicted state is the correct state. In DMX terms, this can be expressed as follows:

Let’s generate a report that shows the number of true-positive, true-negative, false-positive and false-negative predictions and the bank’s expenses associated with all of the transactions. The report allows the user to model various thresholds and pick one that minimizes the associated expenses.

The accompanying download for this tip includes the complete BI Development Studio solution for the report as well as a backup file of the Analysis Services database containing the model that’s used by the report.

The following steps walk you through the process of creating the same report from scratch.

Step 1: Restore the database backup

Go to SQL Server Management Studio, connect to your Analysis Services instance and restore the database from the BankDB.abf file that is part of the download.

3.To define a new data source, select “Microsoft SQL Server Analysis Services” from the list of data source types, and click the “Edit…” button. Type in the server name and select the database name. Click OK and Next.

4.Click the “Query Builder…” button on the “Design the Query” wizard page.

a.When the Query Builder is opened, click the “Query Parameters” button in the toolbar and define the following query parameters:

·The State parameter will be used to specify the state that is being predicted. Fraudulent transactions have a value of 1 in the Fraudulent column, that is why parameter value is set to 1.

Let's say you have an
application that builds mining models on the fly for your business users. How do
you quickly generate a visualization that they can easily share as a interactive
web page? No worries - the Visio Add-in from the
SQL Server 2005 DM Add-ins
for Office 2007 package has a programmability component that will let you
accomplish this with a dozen or so lines of C# code.

The
component connects to an Analysis Services instance, starts the Visio
application in silent mode, uses the Visio engine in conjunction with the Add-in
to render the data mining diagram in Visio and finally takes advantage of the Save as Web
feature in Visio to convert the drawing to a interactive HTML web page.

The
Visio Add-in supports the following visualizations:

Decision Tree

Regression Tree

Dependency Network

Association Rules

Clustering (with characteristics and
discrimination information)

Setting up the C# Project

Since the required
add-in
assemblies are in the GAC, you have to copy them to a local folder before you
can add the programmability component as a project reference. The steps below show how to locate and copy the files from
the GAC:

Open a command prompt and navigate to
the folder %windir%\assembly\GAC_MSIL.

This tip explores a DMX extension introduced in SQL Server 2005 SP2 that can be used to render accuracy reports directly in Reporting Services.

One of the nice improvements we made to SQL Server 2005 DMX was the ability to predict the probability of a particular state. That is, in SQL Server 2000, you could get the probability of the predicted state using the following query:

This query gives you the probability of whatever state is predicted for Gender for each row of the input. However if you wanted to get the probability that a customer was ‘Female’, for example, you would have to write a query like this:

And specify the desired state at run-time. Wonderful! However, there was a little piece missing. It turns out we even allowed arbitrary expressions like ‘Fe’+’male’ there, but we left out the ability to determine what the probability of a known state from the input. That means that if you wanted to determine what the probability was of a state and the state happened to be stored in the database, you would have to first issue a query to get the state and then issue another query to get the probability of that state. Not fun.

In SQL Server 2005 SP2 Data Mining we generalized the language so that essentially, every where you can place a parameter, you can also place a column reference. That means that you can solve the above problem with a query like this:

Now, where is this useful? We’re sure all of the creative DMXers out there will find innumerable uses for this new flexibility, but the most obvious application is for accuracy and lift calculations. Now, in one query, you can generate lift figures for your entire testing set without caching the data, allowing, for example, the generation of accuracy reports in Reporting Services! In the next tip, we will show you an example of this.

This tip shows you how to compute the R-Squared and Adjusted R-Squared metrics for Microsoft regression models using a stored procedure.

Calculating R-Squared for Microsoft regression models

R-Squared
(a.k.a. Coefficient of Determination) is a well-known metric that measures
the goodness fit of your regression model. Its value lies between 0 and 1. The
closer R-Squared is to 1, the better your model is with respect to the training
data.

R-Squared
can be calculated as:

R-Squared = 1 – RSS / TSS

where RSS is the
residual sum of squares, and TSS is the total sum of squares.

Another
interesting measure is Adjusted
R-Squared, which adjusts the R-Squared according to the number of
explanatory terms in a model.

This
article shows you how to compute both these measures for Microsoft mining
models using our favorite mechanism – stored procedures!

A Windows application that
demonstrates how to call the stored procedure. To build this project, you
need to add a reference to Microsoft.AnalysisServices.AdomdClient.dll
located in “%ProgramFiles% \Microsoft.NET\ADOMD.NET\90”.

Let's walk you through the process of using the pieces in the
package to explore the R-Squared sample.

Step 1:

First,
build and deploy the stored procedure (note that this version only
supports models that allow drill through):

Open the project in the
RSquareHelper with
Visual Studio 2005.

Add a reference to the
msmgdsrv.dll (Microsoft.AnalysisServices.AdomdServer.dll) class libraryThis class library is available in the location where Analysis Services
2005 is installed By default, this location has the form: C:\Program Files\Microsoft SQL Server\MSSQL.2\OLAP\binPlease note that, depending on your installation options, MSSQL.2 might be
MSSQL.1 or MSSQL.3 or so

Build the project

Deploy the stored procedure on
the server:

Open SQL
Server Management Studio

Connect to
the target Analysis Services server instance

In the Object
Explorer, select the 'Assemblies' node at the server level

Right-click
and select 'New Assembly'

For the
FileName, field, use the
Browse (...) button to navigate to the location where the stored
procedure was built. A file named RSquareHelper.dll should be available
at that location, under the bin\debug or bin\release folder. Select that
file

Assembly Name
should be automatically filled with RSquareHelper

Set the
permission of this assembly to unrestricted.

Select OK and
deploy the stored procedure

To
call the stored procedure, you can execute the following command in SQL Server
Management Studio:

Now, go ahead and deploy the "RSquareLinearRegression"
Analysis Services project. This project creates a simple linear
regression model with two integer attributes x and y, where x is used as input
and y as output.

Step 3:

Finally, you can build and run the C# application CalculateRSquarein
the package. This application demonstrates how to call the stored procedure to calculatethe R-Squared metric.

The
following figure shows the interface of this application:

From
the interface, you can specify a server name and press the Connect to server
button to establish the connection to the Analysis Services server. The
application will display a list of valid catalogs, models and target attributes.
After selecting the desired model and target attribute, you can then click Calculate
R-Squared button to check the result. The application calls the stored
procedure to calculate the statistics. After fetching the result,
it shows the statistics in the text boxes at the bottom of the screen.

This tip shows you how to get all the pieces you need to try out the new SQL Server 2005 DM Add-Ins for Office 2007.

Try out the SQL Server 2005 DM Add-ins for Office 2007

So, the SQL Server 2005 Data Mining Add-ins for Office 2007 are
finally available for *free* public download and you can't wait to check them
out. However, you haven't purchased a copy of Office 2007 (yet) and you don't
have access to an installed instance of SQL Server 2005 Analysis Services. How
then do you go about playing with the add-ins? Let's walk you through the steps:

Download and install the
add-ins. Only the
Table Analysis Tools add-in is installed by default so make sure you select
all components during setup if you would like to test-drive all three
add-ins (there is no reason not to).

Follow the Getting Started wizard (which launches the
first time you run Excel 2007 or Visio 2007 after installing the SQL Server
2005 DM Add-ins) to configure your Analysis Services instance correctly for
using the add-ins. If you miss it for some reason when starting Excel 2007
or Visio 2007, the Getting Started wizard is available from the Help
button under the Data Mining tab (ribbon) in Excel 2007 as well as from
Start menu -> Programs -> Microsoft SQL Server 2005 DM Add-ins.

Open up the sample Excel workbook provided with the add-ins
(Start menu -> Programs -> Microsoft SQL Server 2005 DM Add-ins -> Sample
Excel Data), click within the table on the Table Analysis Tools Sample
sheet and select the Analyze tab under Table Tools to see the
Table Analysis Tools ribbon. Or alternatively, you can create your
own table in Excel by selecting a range and clicking the Format as Table
button from the Home tab (ribbon).

You can get more information about the add-ins, including
pointers to additional resources like tutorials
on this page.

This tip explains how to programmatically construct the regression formula on a node from the node distribution content for a regression tree.

Extracting the regression formula for a regression tree node

If you build a
Microsoft_Decision_Trees model with a continuous predictable attribute and one
or more continuous input attributes marked as regressors, the algorithm produces
regression trees with nodes that include a regression formula. You can see this
formula in the mining legend when you select a node in the decision tree viewer.
But how do you make use of this formula in your own app? That is the goal of
this article.

A
little background

Let's provide some
background first though, before we dive into the details.

A
regression tree is a mining model created using the Microsoft Decision Trees
algorithm, where the predictable attribute is continuous. In some cases, one or
more input attributes might also be continuous and can be used as
regressors for the output attribute. This creates decision
tree nodes that include a regression formula, which is a linear function of
the output attribute as a function of all the input attributes. The regression
formula of output Y and inputs X1, X2, X3 is expressed in the format:

Y = Ym + m1(X1 –
X1m) + m2(X2 – X2m) + …..

Where:

Ym: Mean
of Y

Xtm: Mean of Xt

mt: Slope of the
variable Xt with respect to Y

Take the example of
a model created from the AdventureWorksDW sample database
that ships with Analysis Services 2005, where “Age” is the
predictable column (continuous) and “Yearly Income” is an input column
(continuous). When the tree is viewed in the decision tree viewer, the resulting
regression tree has a node that includes the following regression formula:

Age = 33.348+0.00008*(Yearly
Income-51,481.123)

The
formula is interpreted as follows:

·Mean for Age: 33.348

·Mean for Yearly Income: 51,481.123

·Gradient of the line: 0.00008

The
following content query for the node shows the distribution:

SELECTFLATTENED

(

SELECT

ATTRIBUTE_NAME,

ATTRIBUTE_VALUE,

[SUPPORT],

[PROBABILITY],

VALUETYPE

FROM

NODE_DISTRIBUTION

) as D

FROM

[Model].CONTENT

WHERE

NODE_UNIQUE_NAME= '<>'

Distribution:

D.ATTRIBUTE_NAME

D.ATTRIBUTE_VALUE

D.SUPPORT

D.PROBABILITY

D.VALUETYPE

Age

Missing

0

0.000193536

1

Age

33.5595353339787

5165

0.999806464

3

Yearly Income

8.4100185198634E-05

0

0

7

Yearly Income

689.231274498812

0

0

8

Yearly Income

51481.1229428848

0

0

9

29.2299633602484

0

0

11

The
VALUETYPE is defined as follows:

Missing

1

Existing

2

Continuous

3

Discrete

4

Discretized

5

Boolean

6

Coefficient

7

ScoreGain

8

RegressorStatistics

9

NodeUniqueName

10

Intercept

11

Other

12

Mean
for age will be calculated as follows:

For
each attribute with a coefficient (VALUETYPE=7), multiply the value of
co-efficient with the value of mean (VALUETYPE=9) for that attribute and sum
all the products.

51481.1229428848 * 8.4100185198634E-05 = 4.329571974

Add
this to the value of Intercept (VALUETYPE=11)

4.329571974 + 29.2299633602484 = 33.55953533

Mean
for each regressor attribute, “Yearly Income” in this
case is the value of mean (VALUETYPE=9) for that attribute.

51481.1229428848

Gradient
for each regressor attribute, “Yearly Income” in this
case is the value of coefficient (VALUETYPE=7) for that attribute.

8.4100185198634E-05

Using these three values and the sign
(+/-), we can construct the regression formula:

This tip shows you how to extract a set of rules that indicate a certain product from an Association Rules model

SQL Server Data Mining contains an Association Rules
browser, which displays the most important rules found inside a mining model.
However, for certain pattern exploration tasks, it may be useful to visualize
only those rules that point to a certain item.

An example: for a model* that predicts movie associations, extract
those rules that point to ‘Star Wars’. Even more, extract only rules with a probability
larger than .25 and containing ‘Indiana Jones’ or ‘Blade Runner’ on the left
hand side.

All the rules (and item-sets) are exposed in the model
content, the rowset containing all the patterns detected by the model.

Content nodes that represent rules have a NODE_TYPE value of
8 (Item sets have NODE_TYPE = 7). Inside a rule node, the NODE_DISTRIBUTION
field contains the rule’s right hand side item (with a value type of 2, “Existing”)
as well as the identifier of the node representing the left hand side itemset
(value type of 10, NODE_UNIQUE_NAME).

Therefore, to get the results requested in the beginning, we
need a query that:

-Selects the rule nodes (NODE_TYPE=8)

-Filters only those that contain ‘Star Wars’ in NODE_DISTRIBUTION

-Extracts the support, probability and other information about the rule

-Applies a probability threshold ( >.25)

-Filters only those that contain ‘Indiana Jones’ or ‘Blade Runner’

-Sorts the results in descending order, by probability

And here is the query:

SELECTtop 100 FROM

(

selectFLATTENED

NODE_CAPTION AS [Rule],

NODE_PROBABILITY
AS [Rule_Probability],

(SELECT ATTRIBUTE_NAME

FROM NODE_DISTRIBUTION

WHERE

VALUETYPE=2 AND

ATTRIBUTE_NAME='Movies(Star Wars)'

)
AS D

FROM

[MovieClick].CONTENT

WHERE
NODE_TYPE=8

)
AS A

WHERE

[D.ATTRIBUTE_NAME]
<> nullAND

Rule_Probability
> 0.25 AND

(

VBA!InStr(Rule, 'Blade Runner') > 0 OR

VBA!InStr(Rule, 'Indiana Jones') > 0

)

ORDERBY
[Rule_Probability] DESC

The result contains all the requested rules, with their
probability.

* The model created for this example
used the MovieClick data available on this site. The algorithm was
Association Rules with the parameters MINIMUM_PROBABILITY set to 0.1 and
MINIMUM_SUPPORT set to 3.

This article describes two stored procedures for the Analysis Server, procedures that compute the Covariance and Correlation matrices

This article describes two stored procedures for the Analysis Server. The first one is to calculate the covariance matrix between all continuous attributes in the selected data mining model. The second one is to calculate the correlation matrix between all continuous attributes in the selected data mining model. In both cases it does not matter if the attributes are inputs or outputs of the model, the stored procedures ignore that fact.

Preconditions

The stored procedures are to be used with any model as the only parameter. The stored procedures will not work if any of the following is true:

One or more columns contain nested tables

The model does not allow drill through

The model is not processed

Even if those conditions are met the stored procedures will only return a result for the continuous attributes. If none of the attributes are continuous the resulting matrix will be empty.

Interpreting Results

The stored procedures will return a table with three columns. The first two columns indicate the name of the attributes that were used to calculate the covariance/correlation. This table is a flatten version of a square matrix on size nxn where n is the number of attributes. Since covariance/correlation matrixes are symmetric the result only contains one half of the matrix (including the diagonal). This can be done because COV(X, Y) = COV(Y, X) so we only need one result. The diagonal or COV(X, X) represents the variance of X and is included in the results.

Deploying the stored procedure

The stored procedures are created using C# as a class library project. The project contains only 1 file: “MatrixRelationships.cs”. To create the library file with the stored procedures follow the next steps:

Note: MatrixRelationships will be substituted with whatever name was selected in Assembly Name.

The list of available models will show in the DMX query window under mining model.

Debugging the stored procedure

If the user wants to debug the stored procedures follow the next steps:

1. Open the project in Visual Studio

2. Deploy the stored procedures

3. In Visual Studio select Debug->Attach To Process

4. From available processes select msmdsrv.exe

Algorithm description

Covariance Matrix

Covariance matrix algorithm is based on the following formula:

The algorithm works as follows:

1. Extract continuous attributes from the model

2. Verify that:

a. the model is processed

b. the model supports drill through

c. the model does not contain nested tables

3. Extract mean and variance for each attribute

4. Obtain the data from the model (all the Xi and Yi)

5. Create a triangular matrix to store the partial results for each X,Y combination

6. For each row in the data set from the model:

a. Update the number of rows processed

b. Obtain the values for each column and subtract the mean

c. Update all combinations of (X,Y) by adding the appropriate

d. If the value of a column is null for a given row, a 0.0 will be inserted (no covariance)

7. Divide every combination of (X,Y) in the triangular matrix by the total number of rows processed

8. Flatten the triangular matrix into a 3xM matrix where M is the number of continuous attribute combinations, given by where n is the number of continuous attributes.

Correlation Matrix

Correlation matrix algorithm is based on the Pearson product-moment coefficient which is given by the following formula:

The algorithm to calculate the correlation is the same as the covariance plus it divides every (X,Y) combination in the triangular matrix by the product of X and Y standard deviation and the standard deviation is the square root of the variance.

This sample shows the results of rendering data mining viewers with Visio 2007 and saving as a web page

The SQL Server 2005 Data Mining Addins for Office 2007 includes a delightful model rendering template for Visio 2007. With this template, you can render your graphical views from SQL Server Data Mining as Visio diagrams for annotation, enhancement, and presentation.The Data Mining Template for Visio 2007 allows you to render decision trees, regression trees, cluster diagrams, and dependency nets. While most of the behavior of the built-in viewers are preserved in the Visio environment, you also get additional functionality such as the ability to move sub-trees to new pages, grow dependency networks starting at an arbitrary node, or create a cluster diagram using discrimination charts.Once rendered, you can use the power of Visio to annotate your diagrams with text, add supplemental graphics supporting your model, and even apply color themes that match your presentation needs. Finally the models can be presented as Visio documents, embedded into other Office documents or simply saved as a web page.

The links here are to models saved as web pages as examples of some of the possibilities. New links will be added as they are created.

This tip shows you how to prepare training and test data sets for mining models with nested tables.

Sampling Nested Tables

SQL
Server Data Mining is unique in its ability to mine multiple tables of data
simultaneously through the nested table feature. This functionality,
however, adds some complexity when performing basic data manipulation tasks
required for data mining, such as sampling. How does one split a data set
with nested tables into training and testing sets? Well, this is how.

SQL
Server 2005 Integration Services (SSIS) provides all the tools necessary in its
data pipeline to sample data and sample nested data. The basic idea is
that you need to simultaneously read from your case data and nested data, sort
them*, sample the case data, and merge in the nested data. Such a data
flow is shown below.

To create
this package, follow these steps:

1.Create
a new SSIS package.

2.Add a Data Flow task for your
pipeline.

3.Add
data sources for your case and nested tables.

4.Use
a sort transform* to sort all tables by the case identifier. This should
be the key of the case table, and the foreign key of the nested tables that
relates them to the case table.

5.Sample
the sorted case table using a Percentage Sampling transform, label outputs to
“Testing” and “Training”. A rule of thumb is to use 70% of the data for
training and 30% for testing.

6.Use
the Multicast transform on each of the outputs of the nested sorting and the
case sampling (both training and testing).

At this
point (assuming a single nested table) you will have six data streams.
For convenience, we will call these Training1, Training2, Testing1, Testing2,
and Nested1, Nested2.

7.Output
Training1 and Testing1 into destination transforms – these will be your
training and testing case tables. In the diagram above, the destinations
are row count transforms for illustrative purposes. In practice, you
would use a table destination such as the SQL Server Destination transform.

8.Use
a Merge Join transform to join Training2 and Nested1. Join the streams
using the foreign key, and select all of the columns of Nested1 as output.

9.Output
the result of the Merge Join to a destination. This will be your training
nested table.

10.Use
a Merge Join transform to join Testing2 and Nested2. Join the streams
using the foreign key, and select all of the columns of Nested2 as output.

11.Output
the result of the Merge Join to a destination. This will be your testing
nested table.

Once you
have completed the package, run it to perform the sampling and accurately
divide your data into training and testing sets. You can then add these
tables to a Data Source View in an Analysis Services project for data mining.

The
package described above and supporting data is available for
download.

* Sorting

If you are reading
from a sorted data source, you can take advantage of this fact and omit the
sorting step. In general, your relational database will be much better at
sorting than SSIS, so you can use an ORDER BY clause in your source data query
to take advantage of this. Once you have a sorted data source, however,
you need to inform SSIS what the sort keys are. To do this, follow these
steps:

This tip shows you how to execute and use the results of prediction queries in multiple ways within the SQL Server relational database engine.

Executing Predictions from the Relational Server

A common
request from the relational database gurus in the SQL Server Data Mining community
is to execute predictions from the SQL Server relational database engine – either
in batch mode or ‘on-the-fly’ while rows are being inserted into a table. In
this article, we’ll start with the basics of executing DMX queries in T-SQL and
walk you all the way to predicting values in real-time during the INSERT
operation.

-The
mining model is intended to classify Iris flowers based on their sepal/petal
dimensions and has 5 columns:

oPetal
Length (double, input)

oPetal
Width (double, input)

oSepal
Length (double, input)

oSepal
Width (double, input)

oClass
(text, predictable)

Let's now
go through the basic steps for setting up prediction query execution from T-SQL.

Step 1: Create a linked
server

The first
step is to create a linked server inside SQL Server pointing to the Analysis
Services instance. The linked server object is created with a statement like
below:

EXEC master.dbo.sp_addlinkedserver

@server =
N'DMServer',

@srvproduct=N'Analysis Services 2005',

@provider=N'MSOLAP',

@datasrc=N'localhost',

@catalog=N'TestDB'

GO

Upon
executing this statement, a linked server named DMServer
is registered on the database server. The linked server uses the MSOLAP OLE DB
provider to connect to the “localhost” instance of
Analysis Services.

SQL
Server 2005 Management Studio provides a friendly UI that allows you to create
a linked server and to set various linked server options, such as query
timeouts, impersonation rules and others. This is accessible under Server
Objects -> Linked Servers -> New Linked Server …

Step 2: Execute DMX
queries from T-SQL using OPENQUERY

Once the
linked server is created, it can be used for data mining (DMX) queries like this:

SELECT*FROMOPENQUERY(DMServer,

'select node_caption,
node_type from iris.content')

A
potential use of such queries is to use data mining to assign predicted labels
to all rows in a table and then fetch the rows together with the new label and
save them inside the relational server. Such a query typically looks like
below:

INSERT INTO QueryResultsTable SELECT * FROMOPENQUERY(DMServer,

'select … FROM Modell
PREDICTION JOIN OPENQUERY…')

Executing prediction
queries from inside a trigger

Now that
we know how to execute basic data mining queries from the SQL Server 2005
relational server, let's look at the more advanced “on-the-fly” scenario.

For this
purpose, we’ll build a trigger that is launched whenever a row is added to a
certain table. The trigger will execute a data mining prediction and append the
prediction result to the newly inserted row.

Trigger
objects are well documented in the SQL Server 2005 books online. The only
problem for our task is executing the prediction query and using the result.
The problem resides in the fact that:

-The
prediction query must usually be parameterized with the values of the other row
columns

-The
OPENQUERY statement which is used for data mining queries cannot take
parameters or even string variables as the statement text

To
overcome this problem, we will use the EXEC statement.

We’ll
assume that a relational table named NewIris is
created on the SQL Server instance and the table has a set of columns matching
the mining model, i.e.:

-RowKeyint (the identity column)

-PLengthfloat (to be mapped to
model’s Petal Length column)

-PWidthfloat (to be mapped to
model’s Petal Width column)

-SLengthfloat (to be mapped to
model’s Sepal Length column)

-SWidthfloat (to be mapped to
model’s Sepal Width column)

-PredictedClass – varchar(50) (to keep the prediction result)

The
trigger will be launched on inserting a new row and has to

a)Collect
the values for the columns

b)Author
a data mining prediction statement for the current row (a singleton prediction)

c)Execute
the prediction

Inside
the trigger’s scope, the current row can be identified using the @@identity variable.
Note that this variable is only available when the table has an identity column
(RowKey above). The column values for the current row
should be collected into a set of variables, so that they can be used later to
author the DMX query:

DECLARE
@newPLength varchar(15)

DECLARE @newPWidth
varchar(15)

DECLARE
@newSLength varchar(15)

DECLARE @newSWidth
varchar(15)

-- Select the newly added values

SELECT @newPLength =
dbo.FormatFloatForDMX(PLength),

@newPWidth = dbo.FormatFloatForDMX(PWidth),

@newSLength =
dbo.FormatFloatForDMX(SLength),

@newSWidth =
dbo.FormatFloatForDMX(SWidth)

FROM NewIris WHERE $identity=@@identity

The $identity column represents the key column
of the table.

FormatFloatForDMX is a user-defined function which formats a float number
(or a NULL) so that it can be used inside a DMX statement. The function is
defined as shown below:

CREATEFUNCTION FormatFloatForDMX

(

@Value float

)

RETURNSvarchar(20)

AS

BEGIN

DECLARE
@ResultVar varchar(20)

SELECT
@ResultVar =CASE

WHEN
@Value ISNULL

THEN'NULL'

ELSE

CONVERT(VARCHAR(15), @value)

END

RETURN
@ResultVar

END

GO

Now, with
the column values collected, we can author the prediction query.

-- compose the Analysis Services
DMX query

DECLARE @DMQuery varchar(256)

SET @DMQuery ='select [Class] from [Iris]
NATURAL PREDICTION JOIN'+

'(SELECT '+

@newPLength +' AS [Petal
Length], '+

@newPWidth +' AS [Petal Width], '+

@newSLength +' AS [Sepal
Length], '+

@newSWidth +' AS [Sepal Width] '+

') AS T'

The data
mining prediction query is ready to be executed against Analysis Services. Remember, the OPENQUERY
statement does not allow variables as arguments, so this query cannot be
executed directly. Besides, we need to collect the predicted result and insert
it into the NewIris table in the PredictedClass
column. To achieve this, we will use the SQL EXEC instruction.

'UPDATE
[NewIris] SET PredictedClass=@predictedValue WHERE $identity='+CONVERT(varchar(15),@@IDENTITY)

-- execute the OPENQUERY statement

EXEC(@FullQuery)

As you
notice, the SQL statements above declare a string variable, @predictedValue, then populate the
value from the execution of an OPENQUERY statement against the DMServer linked server. After that, it sets the value of
the variable as the PredictedClass column value in
the NewIris table.

The whole
set of SQL instructions is grouped in a single string and executed in an EXEC instructions.

With
this, all the blocks required for defining the trigger are available and the
full trigger definition can be put together as shown below:

This article presents a way of executing multiple DMX statements from inside SQL Server Management Studio.

Analysis Services doesn't support IMultipleResults, so it is not actually possible to execute multiple statements at once. However, SQL Server Management Studio has a way to work around this issue.

If you typed in multiple DMX statements seperated by semicolons and tried to execute them, it won't work. However, usually the reason people want to do this is that they want to script a CREATE MINING MODEL, INSERT INTO, and possibly a SELECT statement together. If you were to use the above method, assuming it would work, it actually would mean that all these statements would get executed in parallel - which wouldn't work or be what you wanted anyway. Because you have to create the model before training, and train before selecting, the statements need to be executed serially.

So what you want to do is to use the SQL Management Studio keyword "GO". When you type GO on its own line in SQL Management studio, it doesn't send the "GO" to the server, rather it sends the query up to that point, waits for a return and then sends the next statement serially (up to the next GO or the end) - exactly the behavior you want, and it works as well!

In a previous
tip, we showed you how to build a stored
procedure that generates a "CREATE MINING MODEL" statement
for any server mining model. This tip extends that
idea to a stored procedure that returns a collection
of statements for re-creating an existing server
mining structure and all its contained mining models
using DMX. This give you a compact
representation of a mining structure and multiple
models that you might have created iteratively using SQL Server
Business Intelligence Development Studio.

All you have to do after building and deploying the
stored procedure is execute the query

CALL
DMXtract.CreateMiningStructureStatements("MiningStructureName")

The
CreateMiningStructureStatements
stored procedure
returns a multi-row result set with a row for each
creation statement - the first statement creates the
mining structure itself and each of the subsequent
statements alter the mining structure and add a
mining model to it. The output looks like this:

Below is the code for
CreateMiningStructureStatements.

[Note: To build this code you will have to add a
reference to Microsoft.AnalysisServices.AdomdServer.
For more information on how to deploy stored
procedures see
this article in Books Online.]

This tip shows how to trace and display the progress of mining model processing with AMO

Tracing Mining Model Processing Using AMO

SQL Server 2005 Analysis Services gives you the ability to
create and process mining structures and models programmatically in your own
.NET application using the AMO object model. Since processing mining models can be time consuming on large data
sets, it would be great to show progress information similar to what is
displayed by Business Intelligence Development Studio and SQL Server Management
Studio.

This article and the
accompanying sample app will show you how to use
the AMO trace object to obtain progress
information and also demonstrate how to display the information in a RichTextBox
control.

This tip shows you how to restrict your associative prediction query results to only those values that are based on rules rather than including popular predictions as well.

SQL Server Data Mining has a powerful associative prediction model allowing for easy creation of predictions based on, for example, shopping cart contents. These predictions can easily be implemented using the PredictAssociation function on a model* as such

SELECT

PredictAssociation([MovieModel].[Movies],5)

From

[MovieModel]

NATURALPREDICTIONJOIN

(SELECT (SELECT'28 Days'AS [Movie]) AS [Movies]) AS t

The statement above will always generate five predictions for the movie ’28 Days’. In this case the results are:

Now say for example, the user adds the movie “What’s New, Pussycat” to their basket. You generate a new query like this:

SELECT

PredictAssociation([MovieModel].[Movies],5)

From

[MovieModel]

NATURALPREDICTIONJOIN

(SELECT (SELECT'What''s New, Pussycat'AS [Movie]

UNIONSELECT'28 Days'AS [Movie]) AS [Movies]) AS t

You fetch the results and amazingly, you get exactly the sameresults as before! How is this possible? Well, let’s explain. The associative prediction engine in SQL Server Data Mining assumes that if you ask for five predictions, you want to show five predictions. Whenever the engine has no rules based on the input, or not sufficient results based on the input, it pads the results with the most popular items. In this case, the model had no rules for either “28 Days” or “What’s New, Pussycat”, so it returned the top results selected by anyone.

In many cases this behavior is perfectly acceptable. For example, if you have an empty basket, you still want some recommendations – and the most popular items are a good bet. The same may be true for rare items that nobody else purchased – you may still want to show something, so this list is better than no list.

However, in other circumstances, you may only want to show items that are definitely correlated to the input – if for no other reason, just so you know that the suggestions are actually based on the basket contents or not. In this case you can use a feature called the “PredictNodeId” to filter out the padded results.

The PredictNodeId represents the piece of mining model content that supports any particular prediction. If you were to look up the node id in the model’s content you can determine why any particular prediction was made. It turns out, if the prediction was made simply due to an item’s popularity and not based on any learned information, that PredictNodeId will be empty. Therefore, we can use a parameter of PredictAssociation called INCLUDE_NODE_ID, and filter out predictions without rules like this:

Voila! Now the query above returns no results, because the model has no rules based on either input movie. Likewise, the following query, for the movie “1941”, returns one result because a rule does exist for it.

To show that it is still possible for the “top five” to appear in a recommendations list, you can simply add “Star Wars” to the above query and you will see that “The Empire Strikes Back” – a top five movie – is returned in addition to the result generated by “1941”. If you want to find out what the actual rules are for each prediction see this tip for an explanation.

* The model created for this example used the MovieClick data available on this site. The algorithm was Association Rules with the parameters MINIMUM_PROBABILITY set to 0.1 and MINIMUM_SUPPORT set to 3.

Disclaimer: SQLServerDataMining.com is currently managed by members of the SQL Server Data Mining development team at Microsoft Corporation. It does not represent Microsoft’s official position on its products or technologies. All content is provided “AS-IS” with no warranties, and confers no rights.