bogdanc on SQL Server Data Mining

Entries Tagged as 'Cloud'

If you happen to be at the SQL PASS conference this week in Seattle, you will find us presenting the new version of the Predixion Insight package. We added rich collaborative features and completely new visualizations.

One feature that I love in this release is the Data Profiling feature. With a couple of clicks, this feature will compute common statistics for all columns in your PowerPivot or Excel dataset, and render a report allowing you to explore these statistics.

Like all other Predixion Insight tasks, this allows you to select an Excel range or table or a Power Pivot table as your data source for profiling.

The second pane of the wizard allows one to select the type of statistics to be collected: Basic or Advanced. Advanced is a bit slower (think milliseconds to seconds), but so much more useful.

The Data Profiling result is rendered in a new Excel sheet, with a few sections:

Data Summary

the first section, Data Summary, tells you what kind of columns were detected in your data and it looks like this:

CustomerID is identified as a likely key (a unique row identifier), so statistics will not be collected for this column

Continuous statistics

This section contains information about numeric columns. Here are the profile columns generated under Advanced analysis:

- # - the ordinal of the column in the source table. Useful if, after sorting by any other measure, one wants to restore the table to the original layout

- Column Name

- “Looks Like” – the result of the Predixion Insight heuristic analysis of the column. A column may be regarded as Interval (a range of values), Multinomial (a few distinct value, hence categorical or discrete), Binomial (2 values) or Constant (a single value)

- Count of Blanks, Minimum, Maximum, Mean, Sample Variance, Standard Deviation, Range, Kurtosis, Skewness, Standard Error of Mean, Approximate Median, Mode, the 95% confidence interval (assuming normal distribution). If you feel rusty about any of these metrics, just click on the column name and your browser will give you a friendly refresher

Discrete Statistics

Ordinal, Column Name, “Looks Like” and Count of Blanks appear again, just like in Continuous Statistics. The number of distinct states is added for discrete columns.

Other columns:

- Top 80% states – number of distinct states covering 80% of the data

- The Top 3 most/least popular values, with their respective counts

The report can be filtered by column name. You may notice that certain columns appear both in the Continuous and the Discrete section of the report. Such columns may have a numeric type but a distribution which suggests they should be regarded as categorical. The Data Profile report should help you decide whether to use Classify or Estimate for such columns. (If in doubt, this may help: No model for Constant, Estimate for Interval, Classify for all others)

Correlation/Covariance

For numeric columns, here are the correlation and covariance matrices. While not very visible in this dataset, the correlation matrix uses a heat map from 1 (green, perfect positive correlation) to –1 (red, perfect negative correlation). White cells show no correlation (a value of 0). The Correlation matrix can be sorted by correlation for each individual column – sorting it by Column Index, #, will restore the shape of the matrix

A new SQL Server Data Mining tool is available now for you to use directly from the browser — the Prediction Calculator.

The feature (already present in the 2008 version of the Data Mining Add-ins) is a scorecard generator. The tool learns from your data and produces a simple calculator which can be used to compute a score (between 1 and 1000), score which describes the likelihood of a certain event. If you are not familiar with this kind of tools, think of the personality tests that appear in various magazines or about the FICO score.

When you launch the Prediction Calculator, you need to select a target column and a target value (or range). If you target column has categorical values, then you will need to specify exactly your target value. Example: if you want to predict whether some computer will crash in the next month, you will need to select, say, the “Yes” value of the “Will Crash in the Next Month” column. If your target is numeric, you can specify a range. For example, you can use the prediction calculator to figure out how likely it is that your insurance costs will grow between 20% and 50% in one year.

The result, depending on the columns in your data set, looks more or less like below. As you select different values for various columns, the score goes up or down. If it exceeds a certain threshold then you get a positive prediction, otherwise a negative prediction.

For example, try to play with the calculator below to figure out whether you are a likely Bike Buyer or not:

How it works: each of the attributes gets a score, depending on the state. For example, having 0 cars may score 100 points, while having 2 cars may score 0 points (people with many cars are less likely to purchase a bike). The points are added up and, if the total score (always between 0 and 1000) exceeds a certain threshold, then the prediction is positive, otherwise negative.

The Prediction Calculator tool allows you to determine the threshold that maximizes your “profit”. In the lower part of the prediction calculator result (not visible in the embedded snapshot above) you can specify the costs associated with a False Positive (Type I) or False Negative (Type II) error. You can also specify any profits deriving from a correct positive or negative prediction.

The tool computes the total profit generated by various score thresholds based on your inputs and produces a set of diagrams, one for profit and one for cumulative costs, like below:

You can see in the profit diagram that the profit is maximized for a score around 450 or 500. The costs are also minimized in that area.

The tool detects the scorecard threshold that optimizes the profit (and includes it in the page).

Once you produce and tune a prediction calculator, you can embed it in your HTML page, just copy and paste the HTML fragment — the Bike Buyer calculator above is such a fragment.

This week, at the KDD (Knowledge Discovery and Data Mining) conference, we (as in Microsoft SQL Server Data Mining team) presented the Table Analysis Tools for the Cloud, a preview for a technology that enables anybody to play with some of the Microsoft’s data mining tools, without any bulky downloads and with zero configuration effort.

Around May this year I practically entered some sort of sabbatical: 3 months to work on an incubation project of my choice (yes, the Microsoft SQL Server organization does this kind of things! if it sounds appealing, check out our recruiting site or, even better, contact directly our SQL Server Data Mining recruiter, Melsa Clarke - melsac ATmicrosoft DOT com). With some help from Jamie, various management levels and some nice guys in the SQL Server Data Services team, I gathered the infrastructure for a Software as a Service incubation and set up a web incarnation of the Table Analysis Tools add-in for Excel.

Now it is up and running. The entry page is at sqlserverdatamining.com/cloud, so if I got you bored already and you don’t want to read the rest of this stuff, go ahead and browse that page.

What it is

TAT Cloud is a set of canned data mining tasks that you can use without having SQL Server installed on your machine. It consists of encapsulations of some common data mining problems, such as detecting key influencers, forecasting, generating predictive scorecards or doing market basket analysis. The tasks can be executed directly from your browser: just go to the web page, upload your data (in CSV) format and run a tool from the toolbar. Even better, the tasks can be executed directly from Excel. For this, however, you will need to have Excel 2007 and install an add-in which can be downloaded from here.

All tasks work on a table (Excel table or a table in CSV format that you upload to the web interface). All tasks produce reports that can be used to learn more about the analyzed data.

Here is a complete list of features:

- Analyze Key Influencers: it detects the columns that impact your target column. It presents a report of those values in other columns that correlate strongly with values in your target column.

- Detect Categories (clustering, for data miners) — identifies groups of table rows that share similar characteristics. A categories report is generated, which details the characteristics of each category

- Fill From Example — to some extent, similar to Excel’s Autofill feature: it learns from a few examples and extends the learned patterns to the remaining rows in the table

- Forecasting — analyzes vertical series of numeric data, detects periodicity, trends and correlations between series and produces a forecast for those series

How it works

Your data (CSV file or Excel spreadsheet) is uploaded to the web service site. There, Analysis Services crunches it and produces the reports you get either in the browser or in a different spreadsheet. The data and the mining model used for analyzing it are deleted immediately after processing. If something bad happens and your session does not conclude successfully (fancy wording for "if it crashes") then both data and models are removed automatically after 15-20 minutes.

You will notice in the Excel add-in (as well as in the web interface) a strange IP address, 131.107.181.99 — this is IP of the service. It will be changed to something cleaner very soon.

Excel uses HTTPS to connect to the service, in an effort to protect your data. However, you should not use this technology on sensitive data.

What it is not

This is not an official shipping Microsoft technology. It is actually less than even a beta. It may crash, it may produce incorrect results, it may be shot down at any time. BTW, I would appreciate if you posted note here or on the Microsoft data mining forums if this happens (particularly about crashes, I should probably know if it gets shut down)!

What next

Well, try the tools: sqlserverdatamining.com/cloud . I recommend downloading the Excel 2007 add-in, rather than using the web application, as it is more functional. Post any questions you might have on the Microsoft data mining forums. And check out this blog periodically for announcements (typically new additions to the web interface) or for more details on how this stuff works