Why Google BigQuery excels at BI on big data concurrency

Should you use Hadoop for your big data business intelligence needs? BigQuery? What's the difference between on-premises Hadoop, in-cloud Hadoop and a serverless model like Google BigQuery's? A new benchmark from AtScale seeks to help you navigate those questions.

If you're trying to do business intelligence (BI) on big data and the capability to handle large number of concurrent queries is a key issue for you, Google BigQuery may well be the way to go, according to a new Business Intelligence Benchmark released Thursday by AtScale, a startup specializing in helping organizations enable BI on big data.

"Concurrency, I think, was the biggest one," Klahr says. "But the user experience with BigQuery was also really nice. Maybe this isn't a surprise because Google has focused so much on consumer products over the years: Everything about using the product was really nice. The thing that actually took the longest was loading the data from our local network onto the cloud. Once we had the data there, the creation of the tables was really easy."

For its benchmark, AtScale used the same model it deployed last year for its benchmark tests of SQL-on-Hadoop engines on BI workloads. For that test, the idea was to help technology evaluators select the best SQL-on-Hadoop technology for their BI use cases. The goal was the same for the Google BigQuery benchmark.

"The AtScale benchmark provides enterprise leaders with useful comparisons they need to make BI work on big data," Doug Henschen, vice president and principal analyst at Constellation Research, said in a statement Thursday. "As the data grows more complex and diverse, these benchmark stats help enterprises understand leading big data query options and make better decisions critical to supporting BI infrastructure.

AtScale's testing team used the Star Schema Benchmark (SSB) data set, based on widely used TPCH data, modified to more accurately represent a typical BI-oriented data layout. The data set allowed the test team to test queries across large tables: The lineorder table contains close to 6 billion rows and the large customer table contains over a billion rows.

For the Google BigQuery benchmark, AtScale looked at the same three key requirements it used to evaluate the SQL-on-Hadoop engines last year and their fitness to satisfy BI workloads:

Performs on big data. SQL-on-Hadoop engines must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the orders of 10s or 100s of seconds.

Fast on small data. The engine needs to deliver interactive performance on known query patterns and, as such, it is important that the SQL-on-Hadoop engine return results in no greater than a few seconds on small data sets (on the order of thousands or millions of rows).

Stable for many users. Enterprise BI user bases consist of hundreds or even thousands of data workers. The underlying SQL-on-Hadoop engine must perform reliably under highly concurrent analysis workloads.

Last year, AtScale found that Apache Impala 2.3, Apache Spark 1.6 and Apache Hive 1.2 â the three SQL-on-Hadoop engines it benchmarked â all had unique strengths and weaknesses that made them better suited to some BI use cases and less suited to others. For instance, Hive was the slowest of the engines, making it poorly suited for interactive queries, but by far the most stable of the three engines, with the best consistency across multiple query types. Impala and Spark were both better suited to smaller data sets.

As Klahr notes, BigQuery offered the best support for concurrency. It also didn't require much in the way of tuning or system configuration to start using.

"BigQuery doesn't require you to do much tuning and doesn't allow you to do much tooling," he says. "Our experience with Hive and Impala and Spark SQL is that each of those engines requires maybe several days to several weeks to get your parameters right."

AtScale found that the BigQuery management console, query tools and documentation made it easy to use and to support rapid on-boarding. Additionally, the process of moving data to the Google cloud and loading it into BigQuery was simple and well-documented, though Klahr notes the process is certainly faster with cloud-native data than on-premise data.

Performance-wise, BigQuery didn't have quite the zip that Impala and Spark SQL boast, but it was close, Klahr says.

"It is worth considering how much effort it takes to get the performance vs. how much it takes to get acceptable performance," Klahr says.

If there was one area where BigQuery lagged significantly behind the other options, it was in joins.

"It doesn't handle large joins very well," Klahr says. "[Google] really are actively promoting nested data structures where all of your data is in one table."

Matt Baird, CTO and co-founder of AtScale, believes the result of the recent benchmark show how much the big data market has matured, and that platform vendors like Google have become a viable solution to add to an enterprise's mix.

"These results of this benchmark indicate a rapid evolution in the big data market," he said in a statement Thursday. "Such a pace can be daunting for enterprises as they are already dealing with a fair amount of complexity: should they use Hadoop? Should they use BigQuery? What's the difference between on-premise Hadoop, in-cloud Hadoop and a serverless model like Google's? That's why we started AtScale."

Copyright 2018 IDG Communications. ABN 14 001 592 650. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of IDG Communications is prohibited.