Visualization on Impala: Big, Real-Time, and Raw

What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.

The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.

We at Zoomdata, working with the Cloudera team, have figured out how to make Big Data simple, useful, and instantly accessible across an organization, with Cloudera Impala being a key element. Zoomdata is a next-generation user interface for data, and addresses streams of data as opposed to sets. Zoomdata performs continuous math across data streams in real-time to drive visualizations on touch, gestural, and legacy web interfaces. As new data points come in, it re-computes their values and turns them into visuals in milliseconds.

Many businesses have never heard of Hadoop, don’t employ a data scientist, and want their questions answered in seconds.

To handle historical data, Zoomdata re-streams the historical raw data through the same stream-processing engine, the same way you’d rewind a television show on your home DVR. The amount of the data involved can grow rapidly, so the ability to crunch billions of rows of raw data in a couple seconds is important –- which is where Impala comes in.

With Impala on top of raw HDFS data, we can run flights of tiny queries, each to do a tiny fraction of the overall work. Zoomdata adds the ability to process the resulting stream of micro-result sets instead of processing the raw data. We call this approach “micro-aggregate delegation”; it enables users to see results immediately, allowing for instantaneous analysis of arbitrarily large amounts of raw data. The approach also allows for joining micro-aggregate streams from disparate Hadoop, NoSQL, and legacy sources together while they are in-flight, an approach we call the “Death Star Join” (more on that in a future blog post).

The demo below shows how this works, by visualizing a dataset of 1 billion raw records per day nearly instantaneously, with no pre-aggregation, no indexing, no database, no star schema, no pre-built reports, and no data movement — just billions of rows of raw data in HDFS with Impala and Zoomdata on top.

To do that the old way would have taken months to set up, days to load, and hours to run. And furthermore, doing it in real-time and historically, through any imaginable visualization, on any device, is now possible.