Tuesday, September 1, 2009

Data Preprocessing – Normalization - Real time example

In this article, we are going to see normalization in action in a popular web application. People who are not familiar with normalization please refer to my previous post.

We all know very well the capability of Google to exploit the available technology and give innovative products to us. Google insights for search is one such great product from Google. This application's concept is almost completely based on the normalization concepts. Let us see what this application allows us to do, suppose if I want to find who is a more popular tennis player in the year 2009. Serena Williams or Venus Williams? Insight allows to me to find the answer for this question based on the web traffic ( News articles, searches ) for these two keywords.

Wednesday, July 15, 2009

Further to introduction, in this article I am going to discuss “Data Preprocessing” an important step in the knowledge discovery process, can be even considered as a fundamental building block of data mining. People who come from data warehousing background may already be familiar with the term ETL ( Stands for Extraction,Transformation and Loading). Any data mining or data warehousing effort's success is dependent on how good the ETL is performed. DP ( I am going to refer Data preprocessing as DP henceforth) is a part of ETL, its nothing but transforming the data. To be more precise modifying the source data in to a different format which(i) enables data mining algorithms to be applied easily(ii) improves the effectiveness and the performance of the mining algorithms(iii) represents the data in easily understandable format for both humans and machines(iv) supports faster data retrieval from databases(v) makes the data suitable for a specific analysis to be performed.

Wednesday, June 10, 2009

Today Data is abundant around us. With the drastic improvement in performance and reduced hardware costs,computers have become ubiquitous. The number of people using internet is increasing in more than exponential rate. As a consequence of all these, the data is getting accumulated in an unimaginable rate. This massive data makes the traditional methods of analyzing the data almost worthless. Traditional methods usually involve analysing the data, record by record which will consume exponential amount of time even with today's modern computers. So what is the use of storing this astronomical data if we cannot find any useful information out of it?