Pages

Wednesday, November 7, 2012

Imagine if you are having 100MB of data which is stored in structured way (RDBMS) and you need to process it. The best way to use it on your personal computer because PC doesn’t have any problem to process this kind of data. Even PC will help to work up to few GBs of data.

But what will happen when1. Data grow exponentially and you are almost approaching the limits of computer.
2. Data is receiving as unstructured form
3. Data becomes burden to your IT

Management wants to derive the information from both relational and unstructured data. The answer is Hadoop. Hadoop is an open source project of the Apache Foundation and written in Java developed by Doug Cutting who named it after his son’s elephant. Hadoop uses Map Reduce and Google file system technologies as its foundation. Hadoop is opted for distributed deployment not for much parallel for processing. It is optimized to handle massive quantities of data which could be structured (like RDBMS), unstructured (tweets or facebook comments etc.) or semi-structured, using commodity hardware, that is, relatively inexpensive computers. Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers.

Hadoop is not suitable for Online Transaction Processing workloads where data are randomly accessed on structured data like a relational database. Hadoop is not suitable for Online Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence. Hadoop is used for Big Data. It complements Online Transaction Processing and Online Analytical Processing. It is not a replacement for a relational database system.

So, what is Big Data?
With all the devices available today to collect data, such as RFID readers, microphones, cameras, sensors, and so on, we are seeing an explosion in data being collected worldwide. Big Data is a term used to describe large collections of data (also known as datasets) that may be unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistics tools. Therefore, Big Data solutions based on Hadoop and other analytics software are becoming more and more relevant for every type of industry.