Category Archives: BigQuery

Google have launched its BigQuery cloud service in May to support interactive analysis of massive datasets up to billions of rows. Shortly after this launch Qliktech, one of the market leaders in BI solutions who is known for its unique associative architecture based on colunm store, in-memory database demonstrated a Qlikview Google BigQuery application that provided data visualization using BigQuery as backend. This post is about how Qlikview and Google BigQuery can be intagrated to provide easy-to-use data analytics application for business users who work on large datasets.

Qlikview and Google BigQuery

Qlikview has two capabilities depending on the needs and the volume of the data:

Qlikview BigQuery Connector:this add-on is written in .NET – thus it requires Microsoft .NET 4 framework to be installed on your computer -, it loads the data into the in-memory data model and various view types (table, barchart, etc) can be then created on the fly to visualize the data or its subset.

Qlikview BigQuery Extension Object:in case of a huge volume of data not all the data can be loaded into memory. Qlikview BigQuery Extension Object provides a web-based solution, it is built upon Google Javascript API. Users can navigate using the extension object and get only the relevant portion of the data from BigQuery.

Preparing the dataset in Google BigQuery

Before we start working with Qlikview BigQuery solutions, we need to create a dataset in Google BigQuery. We are going to use Apple marketdata donwloaded from finance.yahoo.com site in csv format.

The next step is to upload the csv file into the table – you need to use Chrome browser, as Internet Explorer does not work for file upload as of the writing of this post. The schema that was used is {date:string, open:float, high:float, low:float, close:float, volume:integer, adjclose: float}- just to demonstrate the Google BigQuery is capable of handling various data types:

Finally we can run a simple SQL query to validate that the data has been successfully uploaded:

So far so good, we have the data loaded into Google BigQuery

Qlikview BigQuery Connector

We need to have Qlikview installed, in my test I used Qlikview Personal Edition that can be dowloaded for free from Qliktech website. Then we need to download Qlikview BigQuery Connector from Qlikview market.

Once Qlikview BigQuery Connector is installed, it appears in a similar way as any other connectors (just like ODBC or OLE DB). Go to Edit Script and then choose BigQuery as database:

Once we click on Connect, an authorization window pops up on the screen – Google BigQuery relies on OAuth2.0, thus we need to have OAUth2.0 client id and client secret. The client id and client secret can be created using Google API console. Select ‘Installed application’ and ‘Other’ options.

In Qlikview we need to authenticate ourself using the client id and client secret:

After authentication the next step is to define the Select statement that will be used to load the data from Google BigQuery into memory:

When we click on OK button the data is being fetched into Qlikview in-memory data model (in our case it is 7,000+ lines):

We can start processing and visualizing the data within Qlikview. First we are going to create a table view by right click and then selecting New Object Sheet:

Let us then define another visualization object, a Line Chart:

And then a BarChart – so we will get the following dashboard to present the data that was loaded from Google BigQuery backend into Qlikview in-memory column store:

Qlikview BigQuery Extension Object

As said before, not necessarily all the data can fit into the memory – even if Qlikview is very strong at compressing data, we are talking about massive datasets, aren’t we – that is what big data is all about. In this case Qlikview BigQuery Extension Object comes to the rescue. We need to download it from Qlikview market and install it.

As it is a web-based solution using Javascript (Google Javascript API), we’d better have a Google client id and client secret for web applications, we can create it in the same way as described above for Qlikview connector. The ‘javascript origins’ attribute needs to be modified to http://qlikview.

Then we need to turn on WebView in Qlikview:

Now we are ready to create a new visualization object by right click and selecting New Sheet Object:

Then we have to define the visualization type (Table in this case) and the select statement to fetch the data. Please, note that we limited the data to 100 lines using ‘select date, open, close, high, low from apple.marketdata limit 100; SQL statement:

We can define various visualization objects, similarly to the BigQuery Connector scenario:

Qlikview on Mobile

QlikTech promotes a unified approach for delivering BI solutions for different platforms based on HTML5, no need for additional layers to support data analytics and visualization on mobile devices. Qlikview Server is capable of recognizing mobile browsers and supports touch-screen functionalities.

Qlikview has gained significant popularity among BI tools, Gartner positioned QlikTech in the leaders zone of the Business Inteligent Platforms Magic Quadrant in 2012. It provides highly interactive, easy-to-use graphical user interface for business users and the technology partnership with Google to provide seamless integration with BigQuery can just further strengthen its position.

This time I write about Google BigQuery, a service that Google have made publicly available in May, 2012. It was around for some time, some Google Research blog talked about it in 2010, then Google have announced a limited preview in November, 2011 and eventually it went live this month.

The technology is based on Dremel, not MapReduce. The reason for having an alternative to MapReduce is described in the Dremel paper: “Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce … jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations“.

So what is BigQuery? As it is answered on Google BigQuery website: “Google BigQuery is a web service that lets you do interactive analysis of massive datasets—up to billions of rows.”

Getting Started with BigQuery

In order to be able to use BigQuery, first you need to sign up for it via Google API console. Once that is done, you can start using the service. The easiest way to start with is BigQuery Browser Tool.

BigQuery Browser Tool

When you first login to BigQuery Browser Tool, you see the following welcome message:

There is already a public dataset available, so you can have a quick look around and experience how to use BigQuery Browser Tool. E.g. here is the schema of github_timeline table, a snapshop from GitHub archive:

You can run a simple query using COMPOSE QUERY from the browser tool, the syntax is SQL-like:

So far so good… Let us create now our own tables. The dataset that I was using is from WorldBank Data Catalogue and these are GDP and population data for the countries all over the world. These are available in CSV format (as well as Excel and PDF).

As a first step, we need to create the dataset – dataset is basically one or more tables in BigQuery. You need to click on the down-arrow icon, next to the API project and select “Create new dataset”.

Then you need to create the table. Click on the down-arrow for the dataset (worldbank in our case) and select “Create new table”

Then you need to define table parameters such as name, schema and source file to be uploaded. Note:Internet Explorer 8 does not seem to support CSV file upload (“”File upload is not currently supported in your browser.” message occurs for File upload link). You’d better go with Chrome that supports CSV file upload.

When you upload the file, you need to specify the schema in the following format: county_code:string,ranking:integer,country_name:string,value:integer

There are advanced option available, too: you can use e.g tab separated files instead of comma separated ones, you can defined how many invalid rows are accepted, how many rows are skipped, etc.

During the upload, the data is validated against the specified schema, if that is violated, then you will get error messages in the Job history. (e.g. “Too many columns: expected 4 column(s) but got 5 column(s)” )

Once the upload is successfully finished, you are ready to execute queries on the data. You can use COMPOSE QUERY for that, as we have already descibed for the github_timeline table. To display the TOP 10 countries having the highest GDP values, you run the following query:

I used BigQuery Command line tool from a Windows 7 machine, the usage is very same on Linux with the exception of where the credentials are stored in your local computer. (that could be ~/.bigquery.v2.token and ~/.bigqueryrc in case of Linux and %USERPROFILE%\.bigquery.v2.token and %USRPROFILE%\.bigqueryrc in case of Windows).

When you run it at the first time it needs to be authenticated via OAuth2.

So at the first time, you need to go the the given URL with your browser, Allow Access to BigQuery Command Line tool and copy&paste the generated verification code at the “Enter verification code” prompt. Then it will be stored on your local machine, as mentioned above and you do not need to allow access from then on. (unless you want to initialize the entire access process)

So at the second attempt to run the BigQuery shell it will go flawless without authentication:

BigQuery browser tool and command line tool could do in most of the cases. but hell, aren’t we even thougher guys – Master of the APIs? If yes, Google BigQuery can offer APIs and BigQuery client libraries for us, too. These can be in Python, Java, .NET, PHP, Ruby, Objective-C, etc, etc.

Here is a python application that runs the same SELECT query that we used from browser tool and command line: