Writing a MapReduce Job with the BigQuery Connector

GsonBigQueryInputFormat class

BigQueryInputFormat has been renamed
GsonBigQueryInputFormat to better reflect its nature as a
Gson}-based
format.

GsonBigQueryInputFormat provides Hadoop with the appropriate BigQuery
objects accessible in a JsonObject format via the following primary
operations:

Using a user-specified query to select the appropriate
BigQuery objects

Splitting the results of the query evenly among the Hadoop
nodes

Parsing the splits into java objects to pass to the mapper.
The Hadoop Mapper class receives a JsonObject representation of
each selected BigQuery object.

This class provides access to BigQuery records through an
extension of the Hadoop
InputFormat class. To use this class
correctly, a few lines must be added to the main Hadoop job. In
particular, several parameters must be set in the Hadoop
configuration and the
InputFormat
class must be set to GsonBigQueryInputFormat. Below is
an example of the parameters to set and the lines of
code needed to correctly use GsonBigQueryInputFormat.

Input Parameters

QualifiedInputTableId

The BigQuery table to read from, in the form:
optional-projectId:datasetId.tableIdExample:publicdata:samples.shakespeare

projectId

The BigQuery projectId under which all of the input
operations occur.Example:my-first-cloud-project

Caution: Using this feature to
perform a full-scan query within the connector is normally not cost
effective. The same result can usually be achieved at less cost by using the
bq command-line tool to perform a query into a temporary table,
then running the mapreduce against the temporary table.

Mapper

The GsonBigQueryInputFormat class reads from BigQuery. It passes the
BigQuery objects one at a time as input to the Hadoop Mapper
function. The inputs take the form of a LongWritable and a
JsonObject. The LongWritable tracks the record number. The
JsonObject contains the Json-formatted BigQuery record. The Mapper
accepts the LongWritable and JsonObject pair as
input. Here is a snippet from the Mapper for a
sample WordCount job.

IndirectBigQueryOutputFormat class

IndirectBigQueryOutputFormat provides Hadoop with the ability to write
JsonObject values directly into a BigQuery table. This class provides access
to BigQuery records through an extension of the Hadoop
OutputFormat
class. To use it correctly, several parameters must be set in the Hadoop
configuration, and the OutputFormat class must be set to
IndirectBigQueryOutputFormat. Below is an example of the
parameters to set and the lines of code needed to correctly use
IndirectBigQueryOutputFormat.

IndirectBigQueryOutputFormat works by first
buffering all the data into a Cloud Storage temporary table, and then, on
commitJob, copies all data from Cloud Storage into BigQuery in one operation. Its use
is recommended for large jobs since it only requires one BigQuery "load" job per
Hadoop/Spark job, as compared to BigQueryOutputFormat, which
performs one BigQuery job for each Hadoop/Spark task.

Output Parameters

projectId

The BigQuery projectId under which all of the output
operations occur.Example:
"my-first-cloud-project"

QualifiedOutputTableId

The BigQuery dataset to write the final job results to, in
the form optional-projectId:datasetId.tableId.
The datasetId should already be present in your project.
outputDatasetId_hadoop_temporary dataset will be created in
BigQuery for temporary results. Make sure this does not conflict
with an existing dataset.Examples:test_output_dataset.wordcount_outputmy-first-cloud-project:test_output_dataset.wordcount_output

Reducer

The IndirectBigQueryOutputFormat class writes to BigQuery.
It takes a key and a JsonObject value as input and writes only the
JsonObject value to BigQuery (the key is ignored). The JsonObject should
contain a Json-formatted BigQuery record. The Reducer should output a key of
any type (NullWritable is used in our sample WordCount job)
and JsonObject value pair. The Reducer for the
sample WordCount job is shown below.