Exercise: Create and Query PXF External Tables

Perform the following steps to create HAWQ external table definitions to read the dimension data you previously loaded into HDFS.

Log in to the HAWQ master node as the gpadmin user:

$ ssh gpadmin@<master>

Navigate to the PXF script directory:

gpadmin@master$ cd$HAWQGSBASE/tutorials/getstart/pxf

Start the psql subsystem:

gpadmin@master$ psql
hawqgsdb=#

Create a HAWQ external table definition to represent the Retail demo customers_dim dimension data you loaded into HDFS in Lesson 4; substitute your NameNode hostname or IP address in the <namenode> field of the LOCATION clause:

The LOCATION clause of a CREATE EXTERNAL TABLE statement specifying the pxf protocol must include:

The hostname or IP address of your HAWQ cluster’s HDFS <namenode>.

The location and/or name of the external data source. You specified the HDFS file path to the customer_dim data file above.

The PXF profile to use to access the external data. The PXF HDFS plug-in supports the HdfsTextSimple profile to access delimited text format data.

The FORMAT clause of a CREATE EXTERNAL TABLE statement specifying the pxf protocol and HdfsTextSimple profile must identify TEXT format and include the DELIMITER character used to access the external data source. You identified a tab delimiter character above.

The create_pxf_tables.sql SQL script creates HAWQ external table definitions for the remainder of the Retail dimension data. In another terminal window, edit create_pxf_tables.sql, replacing each occurrence of NAMENODE with the hostname or IP address you specified in the previous step. For example:

Run the create_pxf_tables.sql SQL script to create the remainder of the HAWQ external table definitions, then exit the psql subsystem:

hawqgsdb=#\icreate_pxf_tables.sqlhawqgsdb=#\q

Note: The create_pxf_tables.sql script deletes each external table before attempting to create it. If this is your first time performing this exercise, you can safely ignore the psql “table does not exist, skipping” messages.

Run the following script to verify that you successfully created the external table definitions:

Exercise: Query HAWQ and PXF Tables

Often, data will reside in both HAWQ tables and external data sources. In these instances, you can use both HAWQ internal and PXF external tables to relate and query the data.

Perform the following steps to identify the names and email addresses of all customers who made gift certificate purchases, providing an overall order total for such purchases. The orders fact data resides in a HAWQ-managed table and the customers data resides in HDFS.

Start the psql subsystem:

gpadmin@master$ psql
hawqgsdb=#

The orders fact data is accessible via the orders_hawq table created in the previous lesson. The customers data is accessible via the customers_dim_pxf table created in the previous exercise. Using these internal and external HAWQ tables, construct a query to identify the names and email addresses of all customers who made gift certificate purchases; also include an overall order total for such purchases:

The SELECT statement above uses columns from the HAWQ orders_hawq and PXF external customers_dim_pxf tables to form the query. The orders_hawqcustomer_id field is compared with the customers_dim_pxfcustomer_id field to produce the orders associated with a specific customer where the orders_hawqpayment_method_code identifies GiftCertificate.

Summary

In this lesson, you created PXF external tables to access HDFS data and queried these tables. You also performed a query using this external data and the HAWQ internal fact tables created previously, executing business logic on both your managed and unmanaged data.

This lesson wraps up the Getting Started with HAWQ tutorial. Now that you are familiar with basic environment set-up, cluster, database, and data management activities, you should feel more confident interacting with your HAWQ cluster.