Monday, February 19, 2007

Predictive Modeling and Microsoft Analysis Services 2005

I have been using this product now for 6 months. Also, I went to Microsoft and got a 3 day training by Jamie (thanks!). This is a good product and Microsoft has done an excellent job at bringing data mining to the "masses".

This product is scalable (we are utilizing in over seven terabytes of data every month) and user friendly. It integrates fairly simple with Reporting Services.

The key in how to utilize Analysis Services in a supervised model is the training sample. My main recommendation is that you bring all your data tags into your training sample. In order to determine the size of your training sample population multiply the number of data tags by five and then your data tags will represent 20% of your population.

Another key issue is the modifying of the algorithm parameters. Specifically, the maximum states. In order to determine the maximum number of states in your data I suggest a combination of partition and distribution analyses. You can also use the Microsoft Decision Tree Algorithm.

David did a great job with the data mining algorithms but for those of us who have been in the data mining industry for a long time we need more detail (as well as peer review) articles about the algorithms. For example, the predict and predict probability functions has output that are negative values when this should be a mathematical improbability in an unsupervised model. Even if we filter all the negative inputs we still get negative output. I think that this is a data type kind of issue but we are still researching.

Another issue that is not address in the algorithms is whether any variable or input is improperly influencing the predictive output. Specifically, I would prefer that the models will give us the VIF value for each input. Otherwise, we may find ourselves with one of those situations that are "too good to be true."

The last issue is that the number of Type II errors are extremely large in these models (when we apply the training set to the entire population). Specifically, I am referring to Type II errors that are greater than 60%!!!

Microsoft through Jamie's group is providing us with great technical support and I want to congratulate them for their efforts.