心有多大，舞台就有多大

use Hive Partition to Read/Write with subfolders

We all know that Hive read/write data in folder level, the limit here is that by default it will only read/write the files from/to the folder specified. But sometimes, our input data are organized by using subfolders, then Hive cannot read them if you only specify the root folder; or you want to output to separate folders instead of putting all the output data in the same folder.

For example, we have sales data dumped to hdfs(or s3), and their path structure is like sales/city=BEIJING/day=20140401/data.tsv , as you can see, the data is partitioned by city and day, although we can copy all the data.tsv to the same folder, we need to do the copy and change the filename to avoid conflict, it will be a pain if the files are a lot and huge. On the other hand, even if we do copy all the data.tsv to the same folder, when output, we want to separate the output to different folders by city and day, how to do that?

Can hive be smart enough to read all the subfolder’s data and output to separate folders? The answer is Yes.

Let’s use an example to see how to do this.

Sample Data:

In the following example, we try to do this. Given the SALES data and PRODUCT_COSTS data, what is the TOTAL PROFIT of each product in each city on each day.

SALES data

Format: ID SALE_CITY PRODUCT_NAME QUANTITY SALE_PRICE SNAPSHOT_DAY

PRODUCT_COSTS data

Format: ID PRODUCT_NAME PRODUCT_COST SNAPSHOT_DAY

PROFIT OUTPUT

The the Profit of each record is: (SALE_PRICE – PRODUCT_COST) * QUANTITY
Format: PRODUCT_NAME QUANTITY TOTAL_PROFIT

1.Hive configuration

At first, we need to enable the dynamic partition functionality of HIVE.
put the following 2 lines at the top of your hive script:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

2.Input part

For the input, we can only specify the LOCATION to be the root folder, and when we create the table, we need to specify the partitions.

For (1), we added PARTITION (CITY, DAY) after the INSERT OVERWRITE TABLE xxx statement.
For (2), we appended the SALE_CITY, SNAPSHOT_DAY at the end of the record in the order the partitions are specified.

By doing this, Hive will separate the data by the partitions and output it to separate folders using the partition rules(xxx=yyy pattern). Here it will output to /output/city=BEIJING/day=20140401 …

Here are the sample codes, I tested it on my local environment and on AWS EMR cluster, and it produces correct output.