bogdanc on SQL Server Data Mining

Entries Tagged as ''

A new SQL Server Data Mining tool is available now for you to use directly from the browser — the Prediction Calculator.

The feature (already present in the 2008 version of the Data Mining Add-ins) is a scorecard generator. The tool learns from your data and produces a simple calculator which can be used to compute a score (between 1 and 1000), score which describes the likelihood of a certain event. If you are not familiar with this kind of tools, think of the personality tests that appear in various magazines or about the FICO score.

When you launch the Prediction Calculator, you need to select a target column and a target value (or range). If you target column has categorical values, then you will need to specify exactly your target value. Example: if you want to predict whether some computer will crash in the next month, you will need to select, say, the “Yes” value of the “Will Crash in the Next Month” column. If your target is numeric, you can specify a range. For example, you can use the prediction calculator to figure out how likely it is that your insurance costs will grow between 20% and 50% in one year.

The result, depending on the columns in your data set, looks more or less like below. As you select different values for various columns, the score goes up or down. If it exceeds a certain threshold then you get a positive prediction, otherwise a negative prediction.

For example, try to play with the calculator below to figure out whether you are a likely Bike Buyer or not:

How it works: each of the attributes gets a score, depending on the state. For example, having 0 cars may score 100 points, while having 2 cars may score 0 points (people with many cars are less likely to purchase a bike). The points are added up and, if the total score (always between 0 and 1000) exceeds a certain threshold, then the prediction is positive, otherwise negative.

The Prediction Calculator tool allows you to determine the threshold that maximizes your “profit”. In the lower part of the prediction calculator result (not visible in the embedded snapshot above) you can specify the costs associated with a False Positive (Type I) or False Negative (Type II) error. You can also specify any profits deriving from a correct positive or negative prediction.

The tool computes the total profit generated by various score thresholds based on your inputs and produces a set of diagrams, one for profit and one for cumulative costs, like below:

You can see in the profit diagram that the profit is maximized for a score around 450 or 500. The costs are also minimized in that area.

The tool detects the scorecard threshold that optimizes the profit (and includes it in the page).

Once you produce and tune a prediction calculator, you can embed it in your HTML page, just copy and paste the HTML fragment — the Bike Buyer calculator above is such a fragment.