This blog discusses all the work i have carried out on various subjects of my interest..

Sunday, June 8, 2008

Datamining using RapidMiner 4.1

Data mining involves searching through databases for correlations and patterns that differ from results that would be anticipated to occur by chance or in random conditions. The practice of data mining in and of itself is neither good nor bad and the use of data mining has become common in many industries. I participated in a datamining assignment which analyzed the export patterns of Gems and Jewelry in Sri Lanka. the tool used was RapidMiner 4.1. Some most important facts were found out in the process.

The process of data mining consists of three stages:(1) the initial exploration,(2) model building or pattern identification with, and(3) deployment

Stage 1: Exploration

This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range

The figure shows a screenshot of a graph which was made on the gem exports using Rapid Miner.

Stage 2: Model building and validation

The dimensional model must suit the requirements of the users and support ease of use for direct access. The model must also be designed so that it is easy to maintain and can adapt to future changes.

The figure shows a 2D Model View of the data.

The model design must result in a relational database that supports OLAP cubes to provide instantaneous query results for analysts. a typical dimensional model uses a star or snowflake design that is easy to understand and relate to business needs, supports simplified business queries, and provides superior query performance by minimizing table joins.

Stage 3: Deployment

That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.