Powering BI with ODBC Connectors for CDAP

May 12, 2016

Bhooshan Mogal is a Software Engineer at Cask, where he is working on making data application development fun and simple. Before Cask, he worked on a unified storage abstraction for Hadoop at Pivotal and personalization systems at Yahoo.

Open Database Connectivity (ODBC) is the de-facto standard API for accessing data stored in relational databases. ODBC drivers allow applications across a variety of platforms (especially non-Java) to access relational databases in a manner independent from the implementation and the operating system.

In this blog we will discuss the integration between CDAP Datasets and Tableau using the CDAP ODBC driver with a simple use-case. Datasets is a core abstraction within the Cask Data Application Platform (CDAP) for organizing, storing and accessing data from multiple storage engines in a uniform manner. Instead of forcing users to manipulate data with low-level APIs, datasets provide higher-level abstractions and generic, reusable implementations of common data patterns. Some of the datasets that CDAP provides out-of-the-box are Time Partitioned Filesets, Cube, and TimeSeries dataset. Another motivation behind datasets is to allow them to be accessed (both read and write) across multiple processing paradigms (both real-time and batch) like CDAP Flows, MapReduce, Spark and others.

In addition, CDAP allows developers, data scientists, as well as business analysts familiar with SQL to explore datasets. Since the platform supports SQL, users can also use the CDAP JDBC driver in their Java applications to programmatically access and manipulate this data. We recently added ODBC support in CDAP, enabling a wider variety of applications that support ODBC drivers, with seamless access to CDAP datasets. Let’s see it in action.

The following example shows a typical Cask Hydrator pipeline used to ingest customer data into a CDAP Table dataset. The pipeline reads a stream of events containing comma-separated customer information from a CDAP Stream. It then parses the data to extract fields and loads them into a table dataset “customers_ingest”.

Now that the data has been ingested into the “customers_ingest” dataset, users can explore the data using CDAP Explore as shown below:

In a typical data-driven organization, such customer data could be used in various ways to derive insights about customers, their usage patterns, purchase history, and the like. Since this data also contains location information such as address, zip code, etc, one use-case for this data could be to plot it on a map using Tableau. Let’s see how users can easily create a map that depicts the density of users in our dataset by state.

The first step for this would be to install the CDAP ODBC driver following these instructions. Once installed, users can connect to CDAP from Tableau, by selecting the “dataset_customers_ingest” table. Once the dataset_customers_ingest table is connected, users will automatically be able to explore data in that table like below:

Now let’s plot this data on a map. Once the “dataset_customers_ingest” table has been selected as a data source, we select “State” as a dimension on the left add a map widget to the Tableau sheet, and the data will instantly appear on the map. The map shows the density of customers by state, using the default “SUM” measure.

As you’ve seen, the CDAP ODBC driver allows users to perform powerful analytics on CDAP datasets with a few clicks by integrating with Tableau. Other capabilities of Tableau (and other BI tools that support ODBC) can also be similarly exercised on CDAP datasets using this integration. Please try out the CDAP ODBC driver and let us know your feedback.