Apache Hive - Internal and External Tables

(4.0)

| 2343 Ratings

Introduction:

The time that we are currently living is called the Data Era for a reason as we see a generation of data from all corners of the world. This data can be what we see on the Social Media platform or the applications that we run in Organizations, log files that these applications generate day in and day out. Here comes a framework with its huge set of offerings to provide time and cost-effectiveness.

Thus came the term Big Data which is nothing but a container for collections of huge datasets that include high volume, higher velocity and varied kinds of data – increasing day in day out. Analyzing such huge amounts of data is not possible using the traditional RDBMS systems and hence the need for frameworks like Apache Hadoop. Hadoop is an open-source framework which targets to store and process Big Data in a distributed environment.

Accelerate your career with Hadoop Training and become experts in Apache Hadoop.

Hadoop MapReduce:

It is nothing but a parallel programming model to process larger amounts of structured, semi-structured and unstructured data.

HDFS (Hadoop Distributed File System):

This is used to store and process datasets and also provides fault-tolerant file system to run on commodity hardware.

Let us look into more details about this wonderful component from the Hadoop’s ecosystem in the sections below.

What is Apache Hive?

As the data grew in size, there was also the scarcity of Java developers who can write complex MapReduce jobs for Hadoop. Hence the advent of Hive comes which is created on top of Hadoop itself. Hive provides a SQL like a language termed HiveQL interface for users to extract data from a Hadoop system. With the simplicity provided by Hive to transform simple SQL queries into Hadoop’s MapReduce jobs, and runs them against a Hadoop cluster.

Apache Hive is well suited for Data warehousing applications in which case the data is structured, static and also formatted. As there are certain design constraints on Hive, it does not provide row-wise updates and inserts (which is coined as the biggest disadvantage of using Hive). As most Hive queries turn out into Map to Reduce jobs these queries will have higher latency due to start up overhead.

What are Hive Internal and External Tables?

Internal or Managed Tables:

The tables that are created with the Hadoop Hive’s context, is very much similar to tables that are created on any of the RDBMS systems. Each of the tables that get created is associated with a directory configured within the ${HIVE_HOME}/conf/hive-site.xml in the Hadoop HDFS cluster.

By default on a Linux machine, it is this path /user/hive/warehouse in HDFS. If there is a /user/hive/warehouse/match created by Hive in HDFS for a match table. All the data for the table is recorded in the same folder as mentioned above and hence such tables are called INTERNAL or MANAGED tables.

When the data resides in the internal tables, then Hive takes the full responsibility of maintaining the life-cycle of the data and the table in itself. Hence it is evident that the data is removed the moment when the internal tables are dropped.

External Tables:

If there is data that is already existing in the HDFS cluster of Hadoop then an external Hive table is created to describe the data. These tables are called External tables, because they are going to be residing in the path specified by the LOCATION properties instead of the default warehouse directory (as described in the above paragraph).

When the data is stored in the external tables and when it is dropped, the metadata table is deleted but then the data is kept as is. This means that Hive evidently ignores the data that is present residing in the path specified by LOCATION property and is left untouched forever. If you want to delete such data, then use the command to achieve the same:

Conclusion:

In this article, we have tried to introduce you to the topic of Apache Hadoop and thereafter one of the powerful components from the Hadoop’s ecosystem – Apache Hadoop Hive. We have also understood the usage of Internal and External tables within Hadoop Hive as well.

List of Other Big Data Courses:

Subscribe For Free Demo

Phone *

E-mail Address *

Free Demo for Corporate & Online Trainings.

About The Author

Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

Categories

Popular Courses in 2019

Related Articles

Mindmajix - Online global training platform connecting individuals with the best trainers around the globe. With the diverse range of courses, Training Materials, Resume formats and On Job Support, we have it all covered to get into IT Career. Instructor Led Training - Made easy.