The readme file says Hive 13 is required, but the published report compares results from Hive 10, so I hope at least some of the queries will work on older Hives.

Our cluster is currently running HDP 2.0.6 with Hive 0.12.0. Can I run any of the queries on Hive 12? If I have to skip some of the queries not supported on older versions of Hive, that’s fine. But if none of the queries will run on Hive 12, I won’t waste time trying. If none of the queries in the testbench will run on Hive 12, how were queries run on Hive 10 for the published report?

Carter, thanks for your response. I’m sure that if I get to the point of scaling this up, this information will be very useful. At this point, I’m still trying to get things to run at any scale (I’m starting with a scale of 10).
tpch-setup.sh works, but tpcds-setup.sh has failed in several ways. Then trying to run a query:
# cd sample-queries-tpch
# hive -i testbench.settings
hive> use tpch_bin_partitioned_orc_100;
hive.exec.pre.hooks Class not found:org.apache.hadoop.hive.ql.hooks.ATSHook
FAILED: Hive Internal Error: java.lang.ClassNotFoundException(org.apache.hadoop.hive...

Carter, thanks for your response. I’m sure that if I get to the point of scaling this up, this information will be very useful. At this point, I’m still trying to get things to run at any scale (I’m starting with a scale of 10).

tpch-setup.sh works, but tpcds-setup.sh has failed in several ways. Then trying to run a query:

Hank, it’s not strictly true that you need Hive 13 to run the benchmark. The problem is that large scale data generation is extremely difficult without using Hive 13, so the benchmark tries to push you to using Hive 13. If you do data generation of any meaningful scale (1 TB+) with Hive 10 or 12 it will crash and be very difficult to tune around. I’ve seen people generate smaller datasets (100GB or so) without problem.
I had a customer go through this a few weeks ago and what they ended up doing was comparing performance of Hive 12 using textfile versus Hive 13 using textfile, and ...

Hank, it’s not strictly true that you need Hive 13 to run the benchmark. The problem is that large scale data generation is extremely difficult without using Hive 13, so the benchmark tries to push you to using Hive 13. If you do data generation of any meaningful scale (1 TB+) with Hive 10 or 12 it will crash and be very difficult to tune around. I’ve seen people generate smaller datasets (100GB or so) without problem.

I had a customer go through this a few weeks ago and what they ended up doing was comparing performance of Hive 12 using textfile versus Hive 13 using textfile, and then Hive 13 using ORCFile. All of the data was generated in Hive 13. If you want to test at 1TB+ you should go this way. You can install a Hive 13 package on your cluster without interfering with Hive 12 and use it to generate the data.

As an example of generating 1 TB of text data use ‘FORMAT=textfile ./tpcds-setup.sh 1000′. The database that is generated can be queried from both Hive 12 and Hive 13. When using textfile you won’t get the benefits of vectorization or ORCFile but you will get the benefit of Tez. In addition you could compare both against Hive 13 using ORCFile / vectorization to see the added benefit.