Bucketing in Hive – Creation of Bucketed Table in Hive

by DataFlair Team ·
Published March 14, 2018
· Updated November 15, 2018

1. Objective

In ApacheHive, for decomposing table data sets into more manageable parts, it uses Hive Bucketing concept. However, there are much more to learn about Bucketing in Hive. So, in this article, we will cover the whole concept of Bucketing in Hive. It includes one of the major questions, that why even we need Bucketing in Hive after Hive Partitioning Concept. At last, we will discuss Features of Bucketing in Hive, Advantages of Bucketing in Hive, Limitations of Bucketing in Hive, Example Use Case of Bucketing in Hive with some Hive Bucketing with examples.

3. Why Bucketing?

Basically, the concept of Hive Partitioning provides a way of segregating hive table data into multiple files/directories. However, it only gives effective results in few scenarios. Such as:– When there is the limited number of partitions.– Or, while partitions are of comparatively equal size.Although, it is not possible in all scenarios. For example when are partitioning our tables based geographic locations like country. Hence, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data). While small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). Hence, at that time Partitioning will not be ideal.Then, to solve that problem of over partitioning, Hive offers Bucketing concept. It is another effective technique for decomposing table data sets into more manageable parts.Let’s see a difference between Hive Partitioning and Bucketing tutorial in detail

4. Features of Bucketing in Hive

Basically, this concept is based on hashing function on the bucketed column. Along with mod (by the total number of buckets). i. Where the hash_function depends on the type of the bucketing column.ii. However, the Records with the same bucketed column will always be stored in the same bucket.iii. Moreover, to divide the table into buckets we use CLUSTERED BY clause.iv. Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based.v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning.vi. Moreover, Bucketed tables will create almost equally distributed data file parts.

5. Advantages of Bucketing in Hive

i. On comparing with non-bucketed tables, Bucketed tables offer the efficient sampling.ii. Map-side joins will be faster on bucketed tables than non-bucketed tables, as the data files are equal sized parts.
iii. Here also bucketed tables offer faster query responses than non-bucketed tables as compared to Similar to partitioning.iv. This concept offers the flexibility to keep the records in each bucket to be sorted by one or more columns.v. Since the join of each bucket becomes an efficient merge-sort, this makes map-side joins even more efficient.Read about What is Hive Metastore – Different Ways to Configure Hive Metastore

6. Limitations of Bucketing in Hive

i. However, it doesn’t ensure that the table is properly populated.ii. So, we need to handle Data Loading into buckets by our-self.

7. Example Use Case for Bucketing in Hive

To understand the remaining features of Hive Bucketing let’s see an example Use case, by creating buckets for the sample user records file for testing in this post

a. Creation of Bucketed Tables

However, with the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement we can create bucketed tables. Moreover, we can create a bucketed_user table with above-given requirement with the help of the below HiveQL.CREATE TABLE bucketed_user( firstname VARCHAR(64),lastname VARCHAR(64),address STRING,city VARCHAR(64), state VARCHAR(64),post STRING,phone1 VARCHAR(64),phone2 STRING,email STRING,web STRING) COMMENT ‘A bucketed sorted user table’PARTITIONED BY (country VARCHAR(64)) CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETSSTORED AS SEQUENCEFILE;Do you Know Feature Wise Difference between Hive vs HBase

As shown in above code for state and city columns Bucketed columns are included in the table definition, Unlike partitioned columns. Especially, which are not included in table columns definition.

b. Inserting data Into Bucketed Tables

However, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH command, similar to partitioned tables. Instead to populate the bucketed tables we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another table. Hence, we will create one temporary table in hive with all the columns in input file from that table we will copy into our target bucketed table for this.Moreover, let’s suppose we have created the temp_user temporary table. Further, for populating the bucketed table with the temp_user table below is the HiveQL.Let’s describe What is HiveQL SELECT Statement In addition, we need to set the property hive.enforce.bucketing = true, so that Hive knows to create the number of buckets declared in the table definition to populate the bucketed table.

Some points are important to Note:i. However, in partitioning the property hive.enforce.bucketing = true is similar to hive.exec.dynamic.partition=true property. So, we can enable dynamic bucketing while loading data into hive table By setting this property.ii. Moreover, it will automatically set the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (for example 32 in our case). Further, it automatically selects the clustered by column from table definition.iii. Also, we have to manually convey the same information to Hive that, number of reduce tasks to be run (for example in our case, by using set mapred.reduce.tasks=32) and CLUSTER BY (state) and SORT BY (city) clause in the above INSERT …Statement at the end since we do not set this property in Hive Session.Let’s read about Apache Hive View and Hive Index

c. Solution For Example Use Case

Along with script required for temporary hive table creation, Below is the combined HiveQL. However, let’s save this HiveQL into bucketed_user_creation.hql. Also, save the input file provided for example use case section into the user_table.txt file in home directory.

8. Conclusion

As a result, we have seen the whole concept of Hive Bucketing. Also, it includes why even we need Hive Bucketing after Hive Partitioning Concept, Features of Bucketing in Hive, Advantages of Bucketing in Hive, Limitations of Bucketing in Hive, And Example Use Case of Bucketing in Hive. As a result we seen Hive Bucketing Without Partition, how to decide number of buckets in hive, hive bucketing with examples, and hive insert into bucketed table.Still, if any doubt occurred feel free to ask in the comment section.
Related Topic- Hive OperatorsFor reference