Extensions

README

Contents

json_fdw

This PostgreSQL extension implements a Foreign Data Wrapper (FDW) for JSON
files. The extension doesn't require any data to be loaded into the database,
and supports analytic queries against array types, nested fields, and
heterogeneous documents.

json_fdw currently only works with PostgreSQL 9.2, and uses YAJL to parse JSON
files. Future releases of this wrapper will use the JSON parser functions that
are to going to be introduced in the PostgreSQL 9.3 release.

Building

json_fdw depends on yajl-2.0 for parsing, and zlib-devel to read compressed
files. So we need to install these packages first:

Once you have yajl-2.0 and zlib installed on your machine, you are ready to build
json_fdw. For this, you need to include the pg_config directory path in your
make command. This path is typically the same as your PostgreSQL installation's
bin/ directory path. For example:

PATH=/usr/local/pgsql/bin/:$PATH make
sudo PATH=/usr/local/pgsql/bin/:$PATH make install

Note: In RedHat 5.X and CentOS 5.X you may need to edit the Makefile and change "-l:libyajl.so.2" to "-lyajl".

Usage

These two parameters can be set on a JSON foreign table object.

filename: The absolute path of a json file or a gzipped json file.

max_error_count: Maximum number of invalid json documents to skip before
erroring out. Defaults to 0.

As an example, we demonstrate querying a compressed JSON file from scratch
here. We note that the underlying file contains JSON documents separated by
newlines, and that no data needs to be loaded into the database. Let's now start
with downloading the file.

-- find all reviews a particular customer made on the Dune series in 1998
SELECT
customer_id, "review.rating", "product.id", "product.title"
FROM
customer_reviews
WHERE
customer_id ='A27T7HVDXA3K2A' AND
"product.title" LIKE '%Dune%' AND
"review.date" >= '1998-01-01' AND
"review.date" <= '1998-12-31';
-- do we have a correlation between a book's title's length and its review ratings?
SELECT
width_bucket(length("product.title"), 1, 50, 5) title_length_bucket,
round(avg("review.rating"), 2) AS review_average,
count(*)
FROM
customer_reviews
WHERE
"product.group" = 'Book'
GROUP BY
title_length_bucket
ORDER BY
title_length_bucket;

Table Schema Conventions

There are three things worth noting about table schemas. First, nested fields in
JSON documents are referenced using dot separators. For example, a field defined
as "review": { "rating" : 5 } in a JSON document is declared as "review.rating"
in the foreign table schema. The quotes around "review.rating" are necessary, as
identifiers that include dots aren't valid in Postgres otherwise.

Second, the foreign table schema is defined at read-time. If you have an
additional field that you'd like to query, such as "review.votes", you can
simply add the column name and start querying for data. You can even create
multiple table schemas for the same underlying file, and query through them.

Third, json_fdw assumes that underlying data can be heterogeneous. If you are
querying for a column, and this field doesn't exist in a document, or the
field's data type doesn't match the declared column type, json_fdw considers
that particular field to be null.

Querying Multiple Files

json_fdw borrows its semantics from file_fdw, and associates one foreign table
with one JSON file. If you'd like to query all your JSON files from one table,
you could take one of two approaches. You could either use PostgreSQL's basic
table partitioning feature, and manually create one child table per JSON file.

Alternatively, you could use CitusDB binaries, and "stage" data into a
distributed foreign table. With this approach, you can also have the database
automatically collect statistics about the underlying data, and apply query
optimizations such as partition pruning. For more info, please see our
documentation page at http://citusdata.com/docs/foreign-data
, or contact us at engage @ citusdata.com.

Limitations

json_fdw only supports files that consist of one JSON document per line. It
doesn't support objects that span multiple lines.

PostgreSQL limits column names to 63 characters by default. If you need column
names that are longer, you can increase the NAMEDATALEN constant in
src/include/pg_config_manual.h, compile, and reinstall.

Copyright

Copyright (c) 2013 Citus Data, Inc.

This module is free software; you can redistribute it and/or modify it under the
GNU GPL v3.0 License.

For all types of questions and comments about the wrapper, please contact us at
engage @ citusdata.com.