Difference between Hadoop and RDBMS

Hadoop vs RDBMS:

RDBMS and Hadoop are different concepts of storing, processing and retrieving the information. DBMS and RDBMS are in the literature for a long time whereas Hadoop is a new concept comparatively. As the storage capacities and customer data size are increased enormously, processing this information with in a reasonable amount of time becomes crucial. Especially when it comes to data warehousing applications, business intelligence reporting, and various analytical processing, it becomes very challenging to perform complex reporting within a reasonable amount of time as the size of the data grows exponentially as well as the growing demands of customers for complex analysis and reporting.

What is Hadoop?

Hadoop is an open source Apache project. Hadoop framework was written in Java. It is scalable and therefore can support high performance demanding applications. Storing very large amounts of data on the file systems of multiple computers are possible in Hadoop framework. It is configured to enable scalability from single node or computer to thousands of nodes or independent systems in such a way that the individual nodes use local computer storage, CPU, memory and processing power. Error handling is performed in the application layer level when a node is failed, and therefore, dynamic addition of nodes, i.e., processing power, in an as needed basis by ensuring the high-availability, eg: without a need for a downtime on production environment, of an individual node.

Hadoop framework was developed based on Google’s MapReduce algorithm. The term BIG data in an organization is the huge amount of information or data that is unable to be processed by using traditional methods within reasonable amount of time. The problem was identified by Internet search companies that had to query very large amount of unorganized and distributed data. Big-Data processing becomes very highly demanded practice in these days and therefore, Hadoop becomes very popular especially for the companies which process BIG data. Facebook , AOL , IBM , ImageShack and Yahoo are some of the companies that have been using Hadoop. Recently, there are hundreds of companies started working on BIG data processing applications based on Hadoop framework.

What is RDBMS?

RDBMS is relational database management system. Database management system (DBMS) stores data in the form of tables, which comprises of columns and rows. The structured query language (SQL) will be used to extract necessary data stored in these tables. The RDBMS which stores the relationships between these tables in different forms such as one column entries of a table will serve as a reference for another table. These column values are known as primary keys and foreign keys. These keys will be used to reference the other tables so that the appropriate data can be related and be retrieved by joining these different tables using SQL queries as needed. The tables and the relationships can be manipulated by joining appropriate tables through SQL queries.

The most important attribute of a relational database system is that a single database system generally has several tables and relationships between these tables so that the information is classified into tables of independent entities. They are also stored independently in a normalized or simplified way and a relationship is maintained within these tables using primary/foreign key constraints. This is different from a flat file or data structure. The data on a database could be stored in a single data file or multiple data files. The data file size will grow or the new data files will be added as the new records are added and the size of the database is increased. These all files are commonly shared by the database server. In high availability systems, these data files are shared so that each node will have access to the same data file. Generally all popular database systems are relational database management systems. In order to give some quick and easy navigation to related data, some logical views are created from the actual tables. There will be a physical existence for every table in the database whereas a view is a virtual table, which does not exist physically rather a logical creation from the existing physical table. IBM DB2, Microsoft SQL Server, Sybase, Oracle, MySQL and PostgreSQL are some examples for RDBMS.

What is the difference between Hadoop and an RDBMS?

Hadoop framework works very well with structured and unstructured data. This also supports variety of data formats in real time such as XML, JSON and text based flat file formats. However, RDBMS only work with better when an entity relationship model (ER model) is defined perfectly and therefore, the database schema or structure can grow and unmanaged otherwise. i.e., An RDBMS works well with structured data. Hadoop will be a choice in environments such as when there are needs for BIG data processing on which the data being processed does not have consistent relationships. Where the data size is too BIG for complex processing, or not easy to define the relationships between the data, then it becomes difficult to save the extracted information in an RDBMS with a coherent relationship.

For example, to analyze Internet data published by various websites. Out of those existing hundreds of millions of websites, each website has different types of contents and the relationships between them are not unique. In such cases, Hadoop is a great choice. Since the exposure of these capabilities increase, the companies choosing Hadoop not only for help handling the historically grown BIG data, but also using Hadoop for meeting high performance needs for new applications. For eg: Plotting a monthly energy usage of a customer by comparing between previous months, between his or her neighbors or even between customers on the same streets. This will bring more awareness, but running such complex comparison by analyzing large set of data takes several hours of processing time, and introduction of Hadoop help improving the computing performance from 10 times to 100 times or more.

RDBMS database technology is a very proven, consistent, matured and highly supported by world best companies. This works better when the data is definitions such as data types, relationships among the data, constraints and etc. Hence, this is more appropriate for real time OLTP processing.

RDMS is generally used for OLTP processing whereas Hadoop is currently used for analytical and especially for BIG DATA processing.

Any maintenance on storage, or data files, a downtime is needed for any available RDBMS. In standalone database systems, to add processing power such as more CPU, physical memory in non-virtualized environment, a downtime is needed for RDBMS such as DB2, Oracle, and SQL Server. However, Hadoop systems are individual independent nodes that can be added in an as needed basis.

The database cluster uses the same data files stored in shared storage in RDBMS systems, whereas the storage data can be stored independently in each processing node.

The performance tuning of an RDBMS can go nightmare. Even in proven environment. However, Hadoop enables hot tuning by adding extra nodes which will be self-managed.

This post also helps answering the following questions: What is the difference between a Hadoop database and a traditional Relational Database? What is the difference between a Hadoop database and a database management system (DBMS)?