Apache Hadoop and Hive for Big Data Storage and Processing

Hadoop alone is a productive framework for distributed big data applications, but combined with Hive it overcomes big data challenges even better.

by Kamal Bhagubhai Mistry

Ira Agrawal

May 15, 2012

Page 1 of 2

Big data is an aptly named concept. Nowadays, huge amounts of data and information are generated by hundreds of millions of applications, devices and human beings all round the world. Be they user's personal data maintained by social networking sites, public sites hosting blogs, weather-related data generated by different types of sensors, or customer and product information maintained by a large enterprise organization -- they are all contributing to big data.

The amount of data sets and the volume of information being generated, processed and analyzed particularly for business intelligence and decision making is growing rapidly as well, making traditional warehousing solutions expensive, difficult to leverage and mostly ineffective. There clearly is a need for a generalized, flexible and scalable tool that can cope with the challenges of big data.

processing the whole set of data in a way that makes the processing easy, efficient and faster

Enter Hadoop for Big Data

The Apache Hadoop is a popular software framework that supports data-intensive distributed applications. It is a map-reduce implementation inspired by Google's MapReduce and Google File System (GFS) papers. Hadoop is written in Java, and many large-scale applications and organizations (e.g. Yahoo and Facebook) use it to run large distributed computations on commodity hardware. It is designed in a way that it can scale up from a single server node to thousands of machines, each offering local computation and storage. All these features and capabilities make Hadoop the best candidate for developers dealing with big data.

However, the map-reduce programming used by Hadoop is very low level and Hadoop lacks the SQL-like expressiveness of query languages, which forces developers to spend a lot of time writing programs for even simple analyses. Hadoop also is not easy for those developers who are not familiar with the map-reduce concept. It requires them to write custom programs for each operation, even for simple tasks like getting the number of rows from a log or averages over some columns. Even the code generated is hard to maintain and reuse for different applications.

Apache Hive, a NoSQL type of database system that runs over Hadoop, overcomes these limitations and provides a neat and simple interface. Hive efficiently meets the challenges of storing, managing and processing big data, which is difficult with traditional RDBMS solutions.