Reading and writing data from BigQuery

This example reads data from
BigQuery
into Spark to perform a word count using SparkContext.newAPIHadoopRDD
(see the
Spark documentation
for more information). It writes the data back to BigQuery using PairRDDFunctions.saveAsNewAPIHadoopDataset.

Before trying this example, either create a dataset named "wordcount_dataset" or
change the output dataset in the code to an existing BigQuery dataset in your
Google Cloud Platform project. Here is the
bq command to create the dataset:

bq mk wordcount_dataset

Scala

Examine the code

IndirectBigQueryOutputFormat provides Hadoop with the ability to write
JsonObject values directly into a BigQuery table. This class provides access
to BigQuery records through an extension of the Hadoop
OutputFormat
class. To use it correctly, several parameters must be set in the Hadoop
configuration, and the OutputFormat class must be set to
IndirectBigQueryOutputFormat. Below is an example of the
parameters to set and the lines of code needed to correctly use
IndirectBigQueryOutputFormat.
IndirectBigQueryOutputFormat works by first
buffering all the data into a Cloud Storage temporary table, and then, on
commitJob, copies all data from Cloud Storage into BigQuery in one
operation. Its use is recommended for large jobs since it only requires one
BigQuery "load" job per Hadoop/Spark job, as compared to
BigQueryOutputFormat, which
performs one BigQuery job for each Hadoop/Spark task.

Run the code on your cluster

On the cluster detail page, select the VM Instances tab, then click the
SSH selection that appears to the right of the name of your cluster's
master node
A browser window opens at your home directory on the master node

$ spark-shell --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
...
Using Scala version ...
Type in expressions to have them evaluated.
Type :help for more information.
...
Spark context available as sc.
...
SQL context available as sqlContext.
scala>

Run wordcount.scala with the :load wordcount.scala command
to create the BigQuery wordcount_output table. The output
listing displays 10 lines
from the wordcount output.

To preview the output table, open your project's
BigQuery
page, select the wordcount_output table, and then click
Preview.

PySpark

Examine the code

Indirect BigQuery import PySpark cannot use the BigQueryOutputFormat to directly load data into BigQuery because, unlike Scala Spark, it cannot create Java GSON JsonObjects inside a map. This example loads the data into Cloud Storage, then invokes a bq command to load the result into BigQuery. Alternatively, you can create a org.apache.spark.api.python.Converter to create the JsonObjects (see
these examples for more information).Temporary file cleanup: This example includes a best-effort cleanup of temporary files in Cloud Storage created by the BigQueryInputFormat. If the job fails, you may need to manually remove any remaining temporary Cloud Storage files, BigQuery datasets, and BigQuery tables. Typically, you'll find temporary BigQuery exports used by InputFormat in gs://bucket/hadoop/tmp/bigquery/ and temporary datasets named after your specified output dataset with a hadoop_temporary_job_[jobid] suffix.

Run the code on your cluster

Use Cloud Dataproc to submit the PySpark code:
Instead of running the PySpark code manually from your cluster's master instance as expained
below, you can submit the PySpark file directly to your cluster using the
Google Cloud Platform console, the gcloud command-line tool, or the Cloud Dataproc REST API→see the
Cloud Dataproc Quickstarts.
Here are the steps using the gcloud command-line tool:

Create wordcount.py locally in a text editor by copying
the PySpark code from the
PySpark code listing

Run the PySpark code by submitting the job to your cluster with the
gcloud dataproc jobs submit command:

On the cluster detail page, select the VM Instances tab, then click the
SSH selection that appears to the right of the name of your cluster's
master node
A browser window opens at your home directory on the master node