Note: The HiveVectorizedORC profile does not support the timestamp data type.

Complex Data Types

Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to text. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.

Examples using complex data types with the Hive and HiveORC profiles are provided later in this topic.

Note: The HiveVectorizedORC profile does not support complex types.

Sample Data Set

Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:

Field Name

Data Type

location

text

month

text

number_of_orders

integer

total_sales

double

Prepare the sample data set for use:

First, create a text file:

$ vi /tmp/pxf_hive_datafile.txt

Add the following data to pxf_hive_datafile.txt; notice the use of the comma , to separate the four field values:

The STORED AS textfile subclause instructs Hive to create the table in Textfile (the default) format. Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.

The DELIMITED FIELDS TERMINATED BY subclause identifies the field delimiter within a data record (line). The sales_info table field delimiter is a comma (,).

Load the pxf_hive_datafile.txt sample data file into the sales_info table you just created:

In examples later in this section, you will access the sales_info Hive table directly via PXF. You will also insert sales_info data into tables of other Hive file format types, and use PXF to access those directly as well.

Perform a query on sales_info to verify that the data was loaded successfully:

hive>SELECT*FROMsales_info;

Determine the HDFS location of a Hive Table

Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can determine a Hive table’s location in HDFS using the DESCRIBE command, for example:

Using PXF and HCatalog to Query Hive

You can query Hive tables directly through HCatalog integration with HAWQ and PXF, regardless of the underlying file storage format. This integration allows HAWQ to directly use table metadata stored in HCatalog.

HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. This provides several advantages:

You do not need to know the table schema of your Hive tables

You do not need to manually enter information about Hive table location or format

If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ.

The following diagram depicts how HAWQ integrates with HCatalog to query Hive tables:

HAWQ retrieves table metadata from HCatalog using PXF.

HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog.

PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.

Enabling HCatalog Integration

To enable HCatalog query integration in HAWQ, perform the following steps:

Make sure your deployment meets the requirements listed in Prerequisites.

If necessary, set the pxf_service_address global configuration property to the hostname or IP address and port where you have installed the PXF Hive plug-in (preferably the HDFS NameNode). By default, the value is set to localhost:51200. For example:

It is not recommended to create a HAWQ table using the WITH (OIDS) clause. If any user tables were created using the WITH (OIDS) clause, additional operations are required to enable HCatalog integration. To access a Hive table via HCatalog when user tables were created using WITH (OIDS), HAWQ users must have SELECT permission to query every user table within the same schema that was created using the WITH (OIDS) clause.

Determine which user tables were created using the WITH (OIDS) clause:

To obtain a description of a Hive table with HCatalog integration, you can use the psql client interface.

Within HAWQ, use either the \d hcatalog.hive-db-name.hive-table-name or \d+ hcatalog.hive-db-name.hive-table-name commands to describe a single table. \d displays only HAWQ’s interpretation of the underlying source (Hive in this case) data type, while \d+ displays both the HAWQ interpreted and Hive source data types. For example, from the psql client interface:

Use \d hcatalog.hive-db-name.* to describe the whole database schema, i.e. all tables in hive-db-name.

Use \d hcatalog.*.* to describe the whole schema, i.e. all databases and tables.

When using \d or \d+ commands in the psql HAWQ client, hcatalog will not be listed as a database. If you use other psql compatible clients, hcatalog will be listed as a database with a size value of -1 since hcatalog is not a real database in HAWQ.

Alternatively, you can use the pxf_get_item_fields user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters. Note: The only supported input profile at this time is 'Hive'.

The following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, fieldtype (HAWQ type), and sourcefieldtype (Hive type).

Limitations

Even for primitive types, HCatalog metadata descriptions produced by \d are HAWQ’s interpretation of the underlying Hive data types. For example, the Hive type tinyint is converted to HAWQ type int2. (See Data Type Mapping.)

HAWQ reserves the database name hcatalog for system use. You cannot connect to or alter the system hcatalog database.

Querying External Hive Data

In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile.

The PXF Hive plug-in supports several Hive-related profiles. These include Hive, HiveText, and HiveRC, HiveORC, and HiveVectorizedORC. The HiveText and HiveRC profiles are specifically optimized for text and RC file formats, respectively. The HiveORC and HiveVectorizedORC profiles are optimized for ORC file formats. The Hive profile is optimized for all file storage types; use the Hive profile when the underlying Hive table is composed of multiple partitions with differing file formats.

Use the following syntax to create a HAWQ external table representing Hive data:

The PXF host. While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService.

<port>

The PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200.

<hive-db-name>

The name of the Hive database. If omitted, defaults to the Hive database named default.

<hive-table-name>

The name of the Hive table.

PROFILE

The PROFILE keyword must specify one of the values Hive, HiveText, HiveRC, HiveORC, or HiveVectorizedORC.

DELIMITER

The DELIMITER clause is required for both the HiveText and HiveRC profiles and identifies the field delimiter used in the Hive data set. <delim> must be a single ascii character or specified in hexadecimal representation.

FORMAT (Hive, HiveORC, and HiveVectorizedORC profiles)

The FORMAT clause must specify CUSTOM. The CUSTOM format supports only the built-in pxfwritable_importformatter.

FORMAT (HiveText and HiveRC profiles)

The FORMAT clause must specify TEXT. The delimiter must be specified a second time in ’<delim>’.

Hive Profile

Use the Hive profile with any Hive file storage format. With the Hive profile, you can also access heterogenous format data in a single table where each partition may be stored in a different file format. In both cases, the Hive profile will use the optimal Hive* profile for the underlying file storage type. Refer to the Partition Filtering discussion later in this topic for additional information on partitioning and the Hive profile.

Example: Using the Hive Profile

Use the Hive profile to create a readable HAWQ external table from the Hive sales_info textfile format table created earlier.

(You can safely ignore the “nonstandard use of escape in a string literal” warning and related messages.)

Notice that:

The LOCATION subclause DELIMITER value is specified in hexadecimal format. \x is a prefix that instructs PXF to interpret the following characters as hexadecimal. 2c is the hex value for the comma character.

The FORMAT subclause delimiter value is specified as the single ascii comma character ','. E escapes the character.

A copy of the sample data set is now stored in RCFile format in sales_info_rcfile.

Perform a Hive query on sales_info_rcfile to verify that the data was loaded successfully:

hive>SELECT*FROMsales_info_rcfile;

Use the PXF HiveRC profile to create a readable HAWQ external table from the Hive sales_info_rcfile table created in the previous step. You must specify a delimiter option in both the LOCATION and FORMAT clauses.:

ORC File Format

The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. HAWQ/PXF supports ORC version 1.2.1.

ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.

ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.

A copy of the sample data set is now stored in ORC format in sales_info_ORC.

Perform a Hive query on sales_info_ORC to verify that the data was loaded successfully:

hive>SELECT*FROMsales_info_ORC;

Start the psql subsystem and turn on timing:

$ psql -d postgres

postgres=>\timingTimingison.

Use the PXF HiveORC profile to create a readable HAWQ external table from the Hive table named sales_info_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORCCUSTOM format supports only the built-in 'pxfwritable_import'formatter.

Example: Using the HiveVectorizedORC Profile

In the following example, you will use the HiveVectorizedORC profile to query the sales_info_ORC Hive table you created in the previous example.

Note: The HiveVectorizedORC profile does not support the timestamp data type or complex types.

Start the psql subsystem:

$ psql -d postgres

Use the PXF HiveVectorizedORC profile to create a readable HAWQ external table from the Hive table named sales_info_ORC that you created in Step 1 of the previous example. The FORMAT clause must specify 'CUSTOM'. The HiveVectorizedORCCUSTOM format supports only the built-in 'pxfwritable_import'formatter.

Accessing Parquet-Format Hive Tables

The PXF Hive profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:

Example: Using the HiveORC Profile with Complex Data Types

In the following example, you will create a Hive table stored in ORC format. You will use the HiveORC profile to query the complex types in the table_complextypes Hive table you created in the previous exercise.

Use the PXF HiveORC profile to create a readable HAWQ external table from the Hive table named table_complextypes_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORCCUSTOM format supports only the built-in 'pxfwritable_import'formatter.

Partition Filtering

The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This enables partition exclusion on selected HDFS files comprising the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a WHERE clause that refers to a specific partition in the partitioned Hive table.

To take advantage of PXF partition filtering push-down, the Hive and PXF partition field names should be the same. Otherwise, PXF ignores partition filtering and the filtering is performed on the HAWQ side, impacting performance.

Note: The Hive plug-in filters only on partition columns, not on other table attributes.

Example: Using the Hive Profile to Access Partitioned Homogenous Data

In this example, you will use the Hive profile to query a Hive table named sales_part you partition on delivery_state and delivery_city fields. You will then create a HAWQ external table to query sales_part, including specific examples illustrating filter pushdown.

Create a Hive table named sales_part with two partition columns, delivery_state and delivery_city:

Create a PXF external table to read the partitioned sales_part Hive table. To take advantage of partition filter push-down, define fields corresponding to the Hive partition fields at the end of the CREATE EXTERNAL TABLE attribute list.

The query filters the delivery_city partition Sacramento. The filter on item_name is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data in the Sacramento partition is transferred for processing.

Query (with pushdown) for all records where delivery_state is CALIFORNIA:

This query reads all of the data in the CALIFORNIAdelivery_state partition, regardless of the city.

Example: Using the Hive Profile to Access Partitioned Heterogenous Data

The Hive profile supports multiple data format types. This support enables you to query a partitioned Hive table that may be composed of data of different formats.

In this example, you will use the Hive profile both directly and indirectly via PXF HCatalog integration to query a partitioned Hive external table. The table is composed of the HDFS data files associated with the sales_info (text format) and sales_info_rcfile (RC format) Hive tables you created in previous exercises. You will partition the data by year, assigning the data from sales_info to the year 2013, and the data from sales_info_rcfile to the year 2016. (Ignore at the moment the fact that the tables contain the same data.)

Create a Hive external table named hive_multiformpart that is partitioned by a string field named year:

Using PXF with Hive Default Partitions

This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column excludes any data that is stored in the table’s default partition.

Similar to Hive, PXF represents a table’s partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a HAWQ query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query.

The insertion of row 4 creates a Hive default partition, because the partition column xdate contains a null value.

In Hive, any query that filters on the partition column omits data in the default partition. For example, the following query returns no rows:

hive>SELECT*FROMsalesWHERExdateISnull;

However, if you map this table as a PXF external table in HAWQ, all default partition values are translated into actual NULL values. In HAWQ, executing the same query against the PXF table returns row 4 as the result, because the filter matches the NULL value.

Keep this behavior in mind when executing IS NULL queries on Hive partitioned tables.