Comparing Middleware and Native BI on Hadoop

In theory, big data technologies like Hadoop should advance the value of business intelligence tools to new heights, but as anyone who has tried to integrate legacy BI tools with an unstructured data store can tell you, the pain of integration often isn’t worth the gain.

Legacy BI tools were built long before data lakes came upon the scene. Most require dedicated BI servers containing data extracted from relational tables into multidimensional cubes. The process to build this configuration was often very time consuming, which seemed acceptable in a data warehouse environment, but not so much in modern data lakes. It seems particularly inefficient to have your IT department perform traditional extract, transform, and load (ETL) procedures on the huge volumes of data—especially unstructured data—from a data lake to a separate BI server.

The Cost of Extracting Data from Your Data Lake “Extracts” (i.e., subsets of data copied from the data source to a separate data silo) are at odds with the reasons most organizations adopt big data technologies. One major reason is to promote data and analytic agility using unstructured data in which you aren’t depending heavily on IT resources to model data up front or to move data across silos. You have more opportunities for data discovery and exploration if you can quickly load data then figure out a formal schema afterwards. This capability is known as schema-on-read, where you read the data before you determine structure (contrast this to schema-on-write in which you must first define the schema before you can write the data to the store). The schemaless nature of big data repositories doesn’t mesh well with the structured data sources BI tools require. While workarounds exist in the form of intermediate engines that enable SQL queries to access Hadoop data stores, they typically don’t provide the levels of user concurrency you need for a production environment. The solution is to create extracts, which defeats the purpose of using Hadoop.

Another major reason for using big data technologies is to have a central data store that offers simultaneous access to many or all of your data sets. Rather than have disparate data sets reside in isolation in data silos, a big data environment like a data lake can help you to identify correlations between data and enrich data to find insights that otherwise would have been difficult to uncover. On a side note, such an environment is also advantageous for machine learning frameworks, which benefit from having more data available for the training process.

Solutions to Enable Analytics on Big Data Without Data Extraction Many BI experts have known that extracts in a big data world is a bad idea, and technologies have evolved to address that problem. Two recent solutions to this problem are to use middleware in the form of online analytical processing (OLAP) tools or to use a BI technology native to big data stores.

OLAP on big data is middleware that plugs into your big data stack and connects legacy BI tools with your data platform. This model has the advantage of eliminating the need to extract data to a separate dedicated server. Instead, predefined views are embedded into the same Hadoop environment as the source data. Those views are typically defined by IT, and that creation process tends to be very time consuming with many iterations, as most veterans of OLAP technologies would know. Machine learning is increasingly being used in some products to create OLAP cubes automatically based upon query patterns. In theory, the predefined cubes should improve over time as more queries are performed, where “improve” means that more of the overall set of possible queries are accelerated while preserving efficient use of disk space.

There are several advantages to the OLAP approach. The cost and delay of building and populating extract servers is largely avoided. Users can also drill down in dashboards directly to the co-located source data. Performance is greatly improved because aggregate cubes are prepared and stored in advance. There can also be significant cost advantages because the cubes scale with the underlying data platform rather than requiring expensive special-purpose servers.In the end, however, this solution suffers from many of the same disadvantages of legacy BI environments. Up-front, time-consuming modeling is still required by the IT organization. Administration costs are higher because each new cube requires IT intervention. For the technologies that build cubes automatically, the design problem is that they try to connect tools and platform at a “lowest common denominator” level since they try to be agnostic on each end. This leads to suboptimal results, so some of these technologies fall short in terms of performance, especially at scale. And significant for many industries, the OLAP approach doesn’t support real-time streaming data, which is increasingly demanded in process-related industries. In short, OLAP on big data is, in many respects, a return to the world of extracts that many organizations hoped to avoid by moving BI processing to Hadoop.

The other recent innovation that does away with extracts entirely is BI native to big data platforms. In this scenario, legacy BI tools are replaced by a modern, integrated processing engine and a visualization interface that directly access the underlying data store. Like the middleware approach, you can leverage Hadoop SQL engines like Hive, Impala, and Drill, and provide extra value such as dashboard acceleration. There is no need for extracts and their associated administrative overhead. No copies are created, so users can drill down directly to detail data. Native BI takes advantage of the underlying Hadoop infrastructure for scalability, enabling both data stores and query volume to scale at low cost. Machine learning is incorporated to recommend cube-like accelerators based on user behavior, avoiding the manual, up-front effort by IT of defining cubes. The need for IT intervention to build extract models is all but eliminated. In many cases, existing BI tools can even run on top of the native visual analytics engine to preserve skills users have already developed.

The principal disadvantage of native visual analytics is user familiarity. While the visualizations produced by modern engines are often superior to those of legacy BI platforms, some users believe they are content with what they have and think less about upside. Some people may need to be introduced to the basic concepts of Hadoop, NoSQL, and other big data constructs in order to understand the new paradigm and its real advantages.