bogdanc on SQL Server Data Mining

Entries Tagged as ''

You may have noticed, our community site has changed. Besides better graphics (including a new and improved question mark for the forums link), it looks better and, more important, it should run much better. The hardware received a much needed upgrade and the community platform was changed from the old ASP.Net Community Starter Kitt to the newer, more scalable, DotNetNuke.

More important — the site now runs SQL Server 2008 (the November CTP). You should notice a difference in the performance of the Live Samples .

The new framework uses a new link system. We tried to make sure that all the old links still work (directly or via a redirector). However, if you find a broken link on the new site, please let us know by sending a mail to the DM Team .

It includes many really cool new features in Analysis Services. Among them: Holdout support, Model filtering, DMX Column aliasing, Drillthrough enhancements, Cross validation and practically two new forecasting algorithms under the Microsoft_Time_Series umbrella.
I intend to present all of them briefly, and I start today with the Holdout support.

Most of the data mining tasks require a validation step, performed right after modeling. This validation step
consist (typically) in evaluating model’s performance against data that was not seen during training (test data).

The test data ideally has the same statistical properties as the training data (data seen by the mining model during training). An easy way to achieve statistical similarity between the training and test set is to use random sampling. This method is not guaranteed to give correct results (statistical similar populations) but, assuming that the random sampling mechanism is independent of the data, it will work in most common scenarios.

SQL Server Integration Services has a Random Sample transform, which extracts a random sample (with a certain percentage) using a mechanism independent of the actual data being samples. This is why we strongly recommended using Integration Services to generate test/training partitions for SQL Server Data Mining.

However, there are a few problems:
- Integration Services will have to save at least one of the sample sets in a relational table (or some form of output destination)
- This sampling method can only be applied to data coming from a relational source (or, in general, a source that can be used with IS). That means it is difficult to use IS sampling with application that do data mining on in-memory data
- Integration Services is rather hard to use to sample data for models with nested tables. It can be done, but it takes around 11 simple steps and 14 transforms to do this for a single nested table (an example is available here: Sampling Nested Tables )

Now, there is a simpler way to do this. You may remember that, in the SQL Server Data mining architecture, a Mining Structure object acts as a data space while a mining model is a problem to be addressed in that data space. As the mining structure describes the data space, it is natural for the structure to partition the data into training and testing sets.

In SQL Server 2008, the new wizard for creating a model (or structure — yes, there is a wizard now for creating a structure with no models!) allows specifying the desired size of the “Holdout” dataset — that is, data to be stored in the mining structure for testing purposes, without being available for models training. By default, the holdout size is 30% (leaving 70% for training). You may choose to specify a fixed number of rows instead of a percentage, or both (in this case the semantic is “use 30%, but no more than 10000 rows for test data”).

The rest of this post shows how to express the holdout wizardry in DMX.

In the previous post I presented a not-so-documented cell function installed in Excel by the data mining add-ins. Jamie also gave some cool demos featuring these functions. To complete this topic, here is a full description of the cell functions and how they can be used.

The Excel Data Mining Client add-in includes 3 cell functions: DMPREDICT, DMPREDICTTABLEROW and DMCONTENTQUERY. These functions are included in the add-ins for SQL Server 2005, but not documented or supported. However, they will be both documented and supported in the add-ins for SQL Server 2008.

To use these functions, you will need a mining model available on the server. For the examples below, I created a Microsoft Decision Trees mining model, built by running the Classify task in the DM Client add-in over the sample Table Analysis Tools Sample spreadsheet in the sample workbook installed by the add-ins.

A recent post on the MSDN Forums raised an interesting issue: Excel Data Analysis’s Linear Regression and SSDM were returning different results. Specifically, the SSDM results were much worse.

The issue turned out to be a data modeling issue (columns were not mapped properly). However, during the investigation I had to compare Excel’s linear regression with the SSDM regression algorithm(s). Thought this might be interesting, so here is one way to compare the results from the two implementations.

I started with some simple (X,Y) data (available for download as a CSV file). First step - run Excel’s Data Analysis regression tool. The results are displayed typically in a separate spreadsheet, and the interesting part is the Coefficients sections:

Next thing — apply Excel’s regression coefficients to the existing data. I did this by adding a column in my data spreadsheet and populating it with a formula:

Next thing, I created a Microsoft Linear Regression mining model on the same data. There is a variety of ways to do this, such as exporting data to a table, connecting directly to the Excel spreadsheet or, the simplest way, by using the Excel add-ins.

To get the model’s predictions in Excel, I used one of the functions exposed by the Excel add-ins, DMPREDICT. If you do not have the add-ins you can always execute a prediction query in SQL Server Management Studio or BI Dev Studio.
However, with the add-ins’s function, getting the prediction results is really easy:

You found it! Good!
Welcome to this page. I’m Bogdan Crivat, a developer in the SQL Server Data Mining team at Microsoft, and this blog will cover mostly data mining topics (ranging from interesting modeling problems up to - or down to - algorithm implementation issues )