How To Offload Data Processing from Google BigQuery

BigQuery is to Google as Redshift is to Amazon: a Big Data warehouse on the cloud for the SQL-savvy. It’s based on Dremel, a database that Google built to handle massive services such as Search, YouTube, Gmail and Docs.

BigQuery is a columnar DB with append-only tables, which is why it can do interactive querying really fast: 35 billion rows in tens of seconds, according to Google.

Now, do we need to throw away the yellow elephant called Hadoop in favor of BigQuery?

Although BigQuery sounds like the end-all solution for processing Big Data, it’s not right for everything. Yes, it’s good for real-time querying, but Google itself recommends using Hadoop’s MapReduce rather than BigQuery for three main cases:

So, if you need to do any of the above data processes, you’ll be better off offloading to Hadoop. Better yet, you can offload it to Xplenty.

Xplenty is powered by Apache Hadoop with added BigQuery integration. You can take data from Google Cloud Storage and many other sources, mine it, join it, and transform it with Xplenty. Then you can store the results on Google BigQuery for your data warehouse needs. Here’s how.

Google BigQuery Settings

Project Access

To run load jobs, you need to grant read permissions to Xplenty for the relevant BigQuery dataset project container:

Xplenty Settings

Configuring the Google BigQuery Connection

Click the new connection button and select Google BigQuery under Analytical Databases.

Enter a name, the project ID that you copied from BigQuery and the relevant dataset ID.

Click the test connection button to make sure that the connection works. If it fails, one of the details may have been entered incorrectly or you haven’t given Xplenty permissions to connect to your BigQuery dataset.

Click Create Google BigQuery connection.

Output Processed Data to Google BigQuery

Open or create a new Xplenty package under My Packages.

Add the Google BigQuery destination component by clicking the plus button under the relevant component.

Click the component to edit it.

Select the relevant Google BigQuery connection and enter the target table. You can also auto-fill the fields with the green button on the right, or check the connection with the green button on the left.

Click okay and don’t forget to save the package before running it.

Summary

Google BigQuery looks great for real-time querying of huge datasets. For optimal performance, you should offload batch processing to Apache Hadoop, especially for complex data mining, large joins and exporting large amounts of processed data. Xplenty, which is powered by Hadoop, can easily offload the data processes and write the results back to Google BigQuery for big querying.