Using Apache Hive with HDCloud

Accessing Amazon S3 Data in Hive

Exposing Data in Amazon S3 as Hive Tables

Datasets stored in Amazon S3 can be easily made available in Hive as managed or external tables. The main difference between these two table types is that data linked in an external table does not get deleted when the table is deleted.

In this case, dropping the “inventory” table does not delete the contents of s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory. It drops only the table definition from the Metastore.

Therefore, external tables are optimal when the data is already present in Amazon S3: once populated, data can be accessed multiple times and it is not deleted even if the tables are accidentally deleted.

Populating Partition-Related Information

When working with data on Amazon S3, the steps for populating partition-related information are the same as when working with data in HDFS.

Creating table definitions does not by itself auto-populate partition-related information to the Metastore. When a dataset available in Amazon S3 is already partitioned, you must run the MSCK command in order to populate the partition-related information into the Metastore.

This statement creates a table definition in the Metastore, but does not populate the partition-related information. To populate the partition-related information, you need to run MSCK REPAIR TABLE inventory.

You can use the hive.metastore.fshandler.threads parameter (default is 15) to increase the number of threads used for scanning the partitions in the MSCK phase.

For more information, refer to .

Analyzing Tables

When working with data on Amazon S3, the steps for analyzing tables are the same as when working with data in HDFS.

Table statistics can be gathered automatically by setting hive.stats.autogather=true or by running analyze table test compute statistics command. For example:

Improving Performance for Hive Jobs

General Performance Tips

As S3 has no concept of file permissions, set hive.warehouse.subdir.inherit.perms = false to reduce the number of file permission checks.

hive.metastore.pre.event.listeners

As S3 has no concept of directory permissions, set hive.metastore.pre.event.listeners= (set to an empty value) to reduce the number of directory permission checks in S3.

You can set these parameters in hive-site.xml.

Reading ORC Format in Hive

Tune these parameters to improve the performance of jobs that read the ORC format:

Parameter

Recommended Setting

hive.orc.compute.splits.num.threads

If using ORC format and you want improve the split computation time, you can increase hive.orc.compute.splits.num.threads (default is 10). This parameter controls the number of parallel threads involved in computing splits. Note that for Parquet it is still single threaded, so split computation can take longer with Parquet and S3.

hive.orc.splits.include.file.footer

If using ORC format with ETL file split strategy, you can set hive.orc.splits.include.file.footer=true to piggyback the file footer information in split payload.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating ETL Jobs

Tune these parameters to improve the performance of ETL jobs:

Parameter

Recommended Setting

hive.stats.fetch.partition.stats

Query launches can be slightly slower if there are no stats available or when hive.stats.fetch.partition.stats=false. In such cases, Hive ends up looking at file sizes for every file it tries to access. Tuning hive.metastore.fshandler.threads helps helps reduce the overall time taken for the metastore operation.

fs.trash.interval

As drop table can be slow in S3 as the action involves moving files to trash (a copy + delete), you can set fs.trash.interval=0 to completely skip trash altogether.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating Inserts in Hive

When inserting data, Hive moves data from a temporary folder to the final location. This move operation is actually a copy+delete action, which is expensive in Amazon S3; the more data is being written out to S3, the more expensive the operation is.
To accelerate the process, you can tune hive.mv.files.thread (default is 15), depending on the size of your dataset. You can set it in hive-site.xml.