Data mining

Professor Stephan Kudyba describes what data mining is and how it is being used in the business world.

What is data mining?
Data mining is the search for new, valuable, and nontrivial information in large volumes of data. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an “interesting” outcome. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.
Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding, and pre-processing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits.
Data mining applies advanced data analysis algorithms from various disciplines such as: statistics, AI (machine learning), pattern recognition, neurocomputing, data bases. Combination of all these approaches allows for solving problems which couldn’t be solved by a single method.

And what is a data mining for us?
Data mining for us is a process integrating knowledge form many different domains of science, which aim is an exploration of knowledge hidden in data. Our philosophy bases on a creative combination of classical, commonly applied analytic methods of data mining with computational and interpretative methods offered by different science domains. This combination is flexible and configured dependently on needs.

Data mining in relation to SQL and OLAP
The usual way of acquiring information from data bases is by building queries, for instance in SQL language. Using this approach the information about all customers that purchased product A and B can be obtained. However, in order to make this query, we have to assume or be aware of such relationship between products. The data mining techniques, on the other hand, allow discovering relationships that we don’t have to define precisely. Only the general relationship has to be specified without selecting the particular products.
To explore our example, the list of customers might be used as a target group for marketing campaign. This campaign would more effective in comparison with campaign directed towards random customers. The campaign effectiveness might be measured as growth in sales. The growth in sales cannot be predicted using SQL techniques. The data mining techniques, unlike SQL techniques, allow for building models that will predict sales growth on the basis of historical data.
OLAP (online analytical processing) is the tool presenting relationships between variables in the form of multidimensional graphical or tabular reports. The OLAP user make thesis and then search for answer in the reports. In order to find out why some loans weren’t paid off the user might create reports including information about incomes, the number of loans. The OLAP report would then answer if these variables are responsible for problems with loan repayment. Like in case of SQL queries the user must presume that there is relationship between variables before performing the analysis.
Furthermore, the OLAP analysis is not suited for problems described by large number of variables. As the number of variables increases the process of making assumptions about variable relationships becomes more difficult. In case of data mining the assumptions about relationships between variables are not needed and the number of the analyzed variables is not limited.

Decision support system
The accuracy of decision depends on the quality of information that is used as the basis for decision making process. In some cases the historical data can be of great value for gathering the appropriate information. This data might include: facts, numbers, plots, figures, or sounds. The process of selecting the appropriate information is called modeling.
The models are created in order to explain the possible results of a decision. A model is equivalent to a process of extracting valuable information from data that explains the decision outcome. Thus the quality of a decision depends on the quality of available information and the accuracy of a model.
The role of a system responsible for gathering data, processing and analyzing the information in order to improve the decision making process might be fulfilled by DSS (decision support system). DSS is a computer-based system that collects data from various sources, applies the appropriate model, provides an easy-to-use interace, presents the results of analysis and allows for the decision maker’s own insights.
By improving the process of data gathering and analysis DSS increases the quality of decision in terms of how accurately the actual results reflect the expected results. The quality of organization management improves as the quality of decision increases. DSS is most efficient in scenarios where it is not obvious what kind of information should be delivered and which models should be applied.

Application of data mining in DSS
Data mining is one of many techniques of data analysis. Its aim is similar to DSS: delivery of analyses that will facilitate the decision making process. The results of data mining analysis and the decision based on this analysis are not final. The expert’s knowledge must be used to assess the possible results of decision. Thus data mining supports decision but it doesn’t make decision.
Data mining utilizes data that might contain information necessary to solve the problem. The data that will be used as an input to data mining process must be selected and preprocessed. This phase cannot be automated and the data preprocessing must be performed by a human – analyst. The analyst by carefully examinig the gathered data and the problem to solve can choose the appropriate model that will deliver the best results.
Data mining fits into the DSS area and is applied to a wide range of problems.