Answered by:

Question about "Clustering Model"

Question

I've started learning of the Microsoft Data Analysis services and don't understand -

why the "State" on the "Cluster Diagram" tab have a negative values? In the dataset I can see only positive values...

P.S. maybe you could propose some books or other sources for the better understanding of tool from MS for data mining. Desirable with simple examples. I've already passed tutorials in BOL, but still can't understand some things.

Answers

This happens because column has mean value close to 0 and very large standard deviation.

When viewer needs to discretize values in the numeric column into 5 bins, ranges for the bins are calculated using mean and standard deviation assuming that values have normal distribution.
That is why it creates bins with negative boundaries.

Correct viewer behavior is to look at the actual minimum and maximum values in the column in addition to mean and standard deviation when determining bin ranges. If anybody from Microsoft is
looking at this thread, please open a bug to fix this.

User should consider preparing data before modeling to get more accurate model. Possible data preparation steps are

replace extreme values with values closer to usual range

remove rows with extreme values

create calculated columns that use original columns and use them instead of the original columns

Can you try changing Date and Number Format to English (US), rebooting the computer and retraining the model? I want to check this because I have seen errors because european countries use , instead of . as a decimal separator.

In your entire dataset, what is the minimum value in the "Billed for9 Month" column?

It looks like a bug in SQL Server to me. If you can find or create a data that you can email to me that causes same problem when you create a clustering model, I might be able to investigate and provide a walkaround. My email is tatyana AT predixionsoftware
DOT com.

This happens because column has mean value close to 0 and very large standard deviation.

When viewer needs to discretize values in the numeric column into 5 bins, ranges for the bins are calculated using mean and standard deviation assuming that values have normal distribution.
That is why it creates bins with negative boundaries.

Correct viewer behavior is to look at the actual minimum and maximum values in the column in addition to mean and standard deviation when determining bin ranges. If anybody from Microsoft is
looking at this thread, please open a bug to fix this.

User should consider preparing data before modeling to get more accurate model. Possible data preparation steps are

replace extreme values with values closer to usual range

remove rows with extreme values

create calculated columns that use original columns and use them instead of the original columns

Microsoft is conducting an online survey to understand your opinion of the Technet Web site. If you choose to participate, the online survey will be presented to you when you leave the Technet Web site.