PXF External Table and API Reference

Pivotal Product Documentation : PXF External Table and API Reference

You can extend PXF functionality and add new services and formats using the Java API without changing HAWQ. The API includes the four classes Fragmenter, Accessor, Resolver, Analyzer. The Fragmenter, Accessor, and Resolver classes must be implemented to add a new service. The Analyzer class is optional.

The plugin (java class) to use for fragmenting data. Used in READABLE external tables only.

ACCESSOR

The plugin (java class) to use for accessing the data. Used in READABLE and WRITABLE tables.

RESOLVER

The plugin (java class) to use for serializing and deserializing the data. Used in READABLE and WRITABLE tables.

Custom Options

Anything else that is desired to add. Will be passed in runtime to the plugins indicated above.

For more information about this example, see "About the Java Class Services and Formats" and the Pivotal Extension Installation and Administrator Guide.

About the Java Class Services and Formats

The Java class names you must include in the PXF URI are Fragmenter, Accessor, and Resolver. The Fragmenter class is mandatory for READABLE tables, and not supported for WRITABLE tables. Pivotal recommends that you reuse a previously-defined Accessor or Resolver data format.

All the attributes are passed from HAWQ as headers to the PXF Java service. The Java service retrieves the source data and converts it to a HAWQ-readable format. You can pass any additional information to the user-implemented services.

The example in "Creating an External Table" shows the available keys and associated values. The example also contains attributes that are passed in from the HAWQ side. The available keys and associated values are as follows:

Attributes are available through the com.pivotal.pxf.api.utilities.InputData class. The following example shows how inputData.getProperty(‘USERINFO1’) returns optional_info.

package com.pivotal.pxf.api.utilities;
/*
* Common configuration of all MetaData classes
* Provides read-only access to common parameters supplied using system properties
*/
public class InputData
{
/*
* Constructor of InputData
* Parses X-GP-* configuration variables
*
* @param paramsMap contains all query-specific parameters from Hawq
* @param servletContext Servlet context contains attributes required by SecuredHDFS
*/
public InputData(Map<String, String> paramsMap);
/*
* Expose the parameters map
*/
public Map<String, String> getParametersMap();
/* Copy contructor of InputData
* Used to create from an extending class
*/
public InputData(InputData copy);
/*
* Returns a property as a string type
*/
public String getProperty(String property);
/*
* returns the number of segments in GP
*/
public int totalSegments();
/*
* returns the current segment ID
*/
public int segmentId();
/* returns the current outputFormat
* currently either text or gpdbwritable
*/
public OutputFormat outputFormat();
/*
* returns the server name providing the service
*/
public String serverName();
/*
* returns the server port providing the service
*/
public int serverPort();
/*
* Returns true if there is a filter string to parse
*/
public boolean hasFilter();
/*
* The filter string
*/
public String filterString();
/*
* returns the number of columns in Tuple Description
*/
public int columns();
/*
* returns column index from Tuple Description
*/
public ColumnDescriptor getColumn(int index);
/*
* returns fragment serialized metadata
*/
public byte[] getFragmentMetadata();
/*
* Set fragment serialized metadata
*/
public void setFragmentMetadata(byte[] location);
/*
* returns fragment user data
*/
public byte[] getFragmentUserData();
/*
* returns a data fragment index
*/
public int getDataFragment();
/*
* returns the column descriptor of the recordkey column.
* If the recordkey column was not specified by the user in the create table statement,
* then getRecordkeyColumn will return null.
*/
public ColumnDescriptor getRecordkeyColumn();
/*
* Returns the data source of the required resource (i.e a file path or a table name).
*/
public String dataSource();
/*
* Sets the data source of the required resource (i.e a file path or a table name).
*/
public void setDataSource(String dataSource);
/* returns the path of the schema used for various deserializers
* e.g, Avro file name, Java object file name.
*/
public String srlzSchemaName() throws FileNotFoundException, IllegalArgumentException;
/*
* returns the ClassName for the java class that handles the file access
*/
public String accessor();
/*
* returns the ClassName for the java class that handles the record deserialization
*/
public String resolver();
/*
* The avroSchema fetched by the AvroResolver and used in case of Avro File
* In case of avro records inside a sequence file this variable will be null
* and the AvroResolver will not use it.
*/
public Object getSchema();
/*
* The avroSchema is set from the outside by the AvroFileAccessor
*/
public void setSchema(Object schema);
/*
* Returns the compression codec (can be null - means no compression)
*/
public String compressCodec();
}

Fragmenter

Note

Icon

The Fragmenter Plugin reads data into HAWQ. Such tables are called READABLE PXF tables. The Fragmenter Plugin cannot write data out of HAWQ. Such tables are called WRITABLE tables.

The Fragmenter is responsible for passing datasource metadata back to HAWQ. It also returns a list of data fragments to the Accessor or Resolver. Each data fragment describes some part of the requested data set. It contains the datasource name, such as the file or table name, including the hostname where it is located. For example, if the source is a HDFS file, the Fragmenter returns a list of data fragments containing a HDFS file block. Each fragment includes the location of the block. If the source data is an HBase table, the Fragmenter returns information about table regions, including their locations.

com.pivotal.pxf.api.Fragmenter

package com.pivotal.pxf.api;
/*
* Interface that defines the splitting of a data resource into fragments that can be processed in parallel
* GetFragments returns the fragments information of a given path (source name and location of each fragment).
* Used to get fragments of data that could be read in parallel from the different segments.
*/
public abstract class Fragmenter extends Plugin {
protected List<Fragment> fragments;
public Fragmenter(InputData metaData) {
super(metaData);
fragments = new LinkedList<Fragment>();
}
/*
* path is a data source URI that can appear as a file name, a directory name or a wildcard
* returns the data fragments
*/
public abstract List<Fragment> getFragments() throws Exception;
}

Class Description

getFragments() returns a string in a JSON format of the retrieved fragment. For example, if the input path is a HDFS directory, the source name for each fragment should include the file name including the path for the fragment.

Accessor

The Accessor retrieves specific fragments and passes records back to the Resolver. For example, the Accessor creates a FileInputFormat and a Record Reader for an HDFS file and sends this to the Resolver. In the case of HBase or Hive files, the Accessor returns single rows from an HBase or Hive table. PXF 1.x or higher contains the following implementations:

Table: Accessor base classes

Accessor class

Description

com.pivotal.pxf.plugins.hdfs.HdfsAtomicDataAccessor

Base class for accessing datasources which cannot be split. These will be accessed by a single HAWQ segment.

QuotedLineBreakAccessor - Accessor for TEXT files that has records with embedded linebreaks

Resolver

The Resolver deserializes records in the OneRow format and serializes them to a list of OneField objects. PXF converts a OneField object to a HAWQ-readable GPDBWritable format. PXF 1.x or higher contains the following implementations:

Table: Resolver base classes

Resolver class

Description

com.pivotal.pxf.plugins.hdfs.StringPassResolver

Supports:

GPBWritable VARCHAR

StringPassResolver replaced the deprecated TextResolver. It passes whole records (composed of any data types) as strings without parsing them

com.pivotal.pxf.plugins.hdfs.WritableResolver

Resolver for custom Hadoop Writable implementations. Custom class can be specified with the schema {{,}} and supports the following:

About Query Filter Push-Down

If a query includes a number of WHERE clause filters, HAWQ may push all or some queries to PXF. If pushed to PXF, the Accessor can use the filtering information when accessing the data source to fetch tuples. These filters only return records that pass filter evaluation conditions. This reduces data processing and reduces network traffic from the SQL engine.

This topic includes the following information:

Filter Availability and Ordering

Creating a Filter Builder class

Filter Operations

Sample Implementation

Using Filters

Filter Availability and Ordering

PXF allows push-down filtering if the following rules are met:

Only single expressions or a group of AND'ed expressions - no OR'ed expressions.

Only expressions of supported data types and operators. See the Pivotal Extension Framework Installation and Administration Guide for more information.

FilterParser scans the pushed down filter list and uses the user's build() implementation to build the filter.

For simple expressions (e.g, a >= 5) FilterParser places column objects on the left of the expression and therefore constants on the right.

For compound expressions (e.g <expression> AND <expression>) it handles three cases in the build() function:

Simple Expression: <Column Index> <Operation> <Constant>

Compound Expression: <Filter Object> AND <Filter Object>

Compound Expression: <List of Filter Objects> AND <Filter Object>

Creating a Filter Builder Class

To check if a filter queried PXF, call the InputData hasFilter() function:

/*
* Returns true if there is a filter string to parse
*/
public boolean hasFilter()
{
return filterStringValid;
}

If hasFilter() returns false, there is no filter information. If it returns true, PXF parses the serialized filter string into a meaningful filter object to use later. To do so, create a filter builder class that implements the FilterParser.FilterBuilder interface:

/*
* Interface a user of FilterParser should implement
* This is used to let the user build filter expressions in the manner she
* sees fit
*
* When an operator is parsed, this function is called to let the user decide
* what to do with it operands.
*/
interface FilterBuilder {
public Object build(Operation operation, Object left, Object right) throws Exception;
}

While PXF parses the serialized filter string from the incoming HAWQ query, it calls the build() interface function. PXF calls this function for each condition or filter pushed down to PXF. Implementing this function returns some Filter object or representation that the Accessor or Resolver uses in runtime to filter out records. The build() function accepts an Operation as input, and left and right operands.

Filter Object

Filter Objects can be internal, such as those you define; or external, those that the remote system uses. For example, for HBase, you define the HBase Filter class (org.apache.hadoop.hbase.filter.Filter), while for Hive, you use an internal default representation created by the PXF framework, called BasicFilter . You can decide the filter object to use, including writing a new one. BasicFilter is the most common:

Here is an example of creating a filter builder class to implement the Filter interface, implement the build() function, and generate the Filter object. To do this, use either the Accessor, Resolver, or both to call the getFilterObject function:

Reference

This section contains the following information:

External Table Samples

Plugin Examples

Configuration Files

External Table Examples

Example 1

Shows an external table that can analyze all Sequencefiles that are populated Writable serialized records and exist inside the hdfs directory sales/2012/01. SaleItem.class is a java class that implements the Writable interface and describes a java record that includes three class members.

Note: In this example the class member names do not necessarily match the database attribute names, but the types match. SaleItem.class must exist in the classpath of every Datanode.

Example 2

Shows an external table that can analyze an HBase table called sales. It has 10 column families (cf1 – cf10) and many qualifier names in each family. This example focuses on the rowkey, the qualifier saleid inside column family cf1, and the qualifier comments inside column family cf8 and uses Direct Mapping:

Example 3

This example uses Indirect Mapping. Note how the attribute name changes and how they correspond to the HBase lookup table. Executing a SELECT from my_hbase_sales, the attribute names automatically convert to their HBase correspondents.