IBM Finds the Need for (SQL on Hadoop) Speed

Alex Woodie

IBM will be joining Cloudera, Hortonworks, and others in the great SQL-on-Hadoop performance race when it ships Big SQL version 3 next month. In addition to peddling unadulterated speed, IBM will be touting security, data federation, and the capability for SQL-based BI tools, like Cognos, to get full access to Hadoop.

While traditional Hadoop version 1 engines like MapReduce have demonstrated the potential of Hadoop, enterprises today demand a broader set of interfaces into Hadoop that can be used with existing business intelligence tools and programmed by mere business analysts, as opposed to elusive (and expensive) data scientists. Most notably, organizations have demanded SQL, which effectively allows Hadoop to mirror the role of a traditional data warehouse, and to operate on structured data.

Hadoop distributors have responded to this demand by giving people what they want. IBM introduced Big SQL a year ago with the launch of InfoSphere BigInsights version 2.1. Before then, Cloudera began its Impala project, while Hortonworks sought to bolster Hive through the Stinger initiative.

Last week, IBM announced the latest incarnation of its Hadoop distribution, InfoSphere BigInsights version 3.0, which also includes Big SQL version 3.0. The new release of the Big SQL engine will bring “full function” SQL capabilities, and will allow users to run SQL on Hadoop in the same way they would for a traditional relational database, without requiring any changes to their apps, IBM says.

Specifically, IBM says Big SQL version 3 brings support for the SQL 2011 language, including support for stored procedures and user-defined functions. This brings Big SQL’s capabilities up to parity with what users expect of a data warehouse, says IBM distinguished engineer Linton Ward, who works with big data analytics in his role as the chief engineer for Power for workload optimized systems.

“What that means is you can now use these tools, like Cognos, that leverage SQL, to access Hadoop data,” Ward tells Datanami. “So Hadoop will still own the data, but it allows you to get SQL interfaces.”

It’s all about enabling a broader group of people access to the new Hadoop repositories. “Are statisticians the best people to be writing Java code?” Ward asks. “Maybe some of the [big data] tooling will be aided by some of the conventional SQL tools out there that have been developed over the last couple of decades.”

Data federation in Big SQL version 3 will enable users to submit SQL statements that tap into other data sources. Big SQL will automatically create the wrappers that submit the SQL query to (and pull data back from) DB2 for LUW, Oracle, and Teradata. This data federation feature will also support IBM’s data warehouse products, PureData System for Analytics and PureData System for Operational Analytics.

Big SQL will also see security enhancements. Specifically, IBM has broadened the ways that authentication can be performed, and now supports processes based on the OS, based on LDAP, or based on custom authentication plug-ins. The fine-grained security policy in Big SQL prevents users from seeing rows and columns of data they don’t have permission to see, IBM says. All user activity can also be tracked and audited, while support for TLS ensures data is encrypted as it moves over the network.

IBM isn’t talking a lot at this point about the performance of Big SQL 3.0. In its announcement letter, IBM says the new interface will include “scale-out parallelism performance to hundred of nodes.” It also talked about “extreme performance,” which doesn’t mean much.

IBM is expected to have new performance benchmark results to talk about when the product ships, which is expected in June. (IBM, in its confounding way, essentially says it will release general availability information when the product is generally available.) “I think you’re going to see some pretty exciting performance coming out of that from the software team,” Ward says.

We’re in the midst of an SQL-on-Hadoop arms race, as vendors seek to differentiate their Hadoop offerings by building the best and highest performing SQL-on-Hadoop interfaces. There’s also an element of marketing and one-upmanship involved, which it appears IBM will be unable to resist partaking in.

In January, Cloudera touted internal benchmarks that showed its Impala SQL-on-Hadoop engine ran twice as fast as an unnamed commercial data warehouse systems and 24 times faster than Apache Hive version 0.12. It also claimed that Impala scaled nearly linearly, at least up to 36 nodes. The company said it was working on another Impala test that scaled up to 1,000 nodes.

Last month, Hortonworks announced that SQL processing in the new Tez-based version of Hive, or version 0.13, ran 100 times faster than Hive version 0.10 when the Stinger initiative started 13 months ago. (The performance benefits versus Hive version 0.12 were not as great.)

As we get closer to Hadoop Summit–the presumed venue for the big unveil of Big SQL 3.0 and InfoSphere BigInsight 3.0–we’ll revisit the latest SQL performance claims.