Saturday, September 01, 2012

Data Mining Components

We identify four components or layers in a data mining engagement as shown in the figure below. Two abstract layers, business problem and data mining algorithms, are on the top. And two physical layers, data mining tools and data management, are at the bottom.

Business problems that we want to solve is one of the most important abstract layers. It could be predicting fraud (bank card, check, medical claims), new customer life time value at point of sales, online ads click rate, credit worthiness, customer segmentation. We can address the above business problems using various predictive models such as logistic regressions, neural nets, support vector machines, K-means clustering.

Data mining tools layer contains commercial or open source software such as SAS, Splus, R, Weca, SPSS, Statsoft, Oracle Data Mining. Common data mining algorithms can be found in almost all of the software mentioned above. Data management/storage layer are relational databases such as Oracle, SQL server, MySQL, or simply files such as SAS files or text files.

We can predict if a new customer will pay his car loan using logistic regression model implemented in SAS and store the data in Oracle. Or we can solve the same problem using decision tree models implemented in R and store data in SQL server. It is important to realize that items within each layer are sometimes exchangeable. We can solve any business problems with varieties of data mining algorithms implemented by commercial or open source tools and store data in any databases. Thus it is a misconcept that neural nets are best in predicting credit card fraud. An experienced data miner can build a decision tree model to predict credit card fraud that performs equally well as a neural net does. We can select the combination of data mining models, tools and databases that suit our needs.