The hot new technology in Big Data is decades old: SQL

The decades-old database technology is staging a comeback.

A slide from a presentation at Hadoop Summit describes HortonWorks' Stinger initiative, an effort to make SQL work better with Hadoop through Apache Hive.

Jason Levitt

Despite the growth of "NoSQL" databases over the past few years, SQL is going nowhere isn't going anywhere. In fact, it seems Structured Query Language is in ascendance in a realm that once seemed bent on excluding it: "Big Data."

At the recent Hadoop Summit, among all of the announcements of new products and partnerships around "big data" analytics, one of the surprising trends was the apparent resurgence of a technology that has been around for decades. Many of the announcements coming from companies at the Summit centered on using SQL as the primary interface for Big Data analytics.

"It looks as if there's not a Hadoop-related vendor here who isn't promoting an SQL solution," said Paco Nathan, former director of data science at Concurrent and now Chief Scientist at Mesosphere, a speaker at Hadoop Summit. “And a few of them sound too good to be true."

Built on Hadoop

Hadoop is the open source batch-processing storage and analysis engine based on research papers published by Google about its MapReduce and Google File System technology. It's the underlying technology behind many "big data" analysis tools being used to sift through huge volumes of information created by Web visits, server logs, and all other sorts of data streams. Facebook, for example, has over 30 petabytes of data in Hadoop clusters, and it created the Hive query front-end for Hadoop (which is now an Apache open-source project). The NSA's Accumulo database, used by the agency to do real-time analysis of intelligence data, is also built on top of Hadoop.

But Hadoop can be a challenging system to learn, since it requires that users understand both its problem solving strategy, called MapReduce, and a programming language that supports MapReduce tasks. MapReduce uses batches of parallel processing jobs to sort through large volumes of data. SQL, on the other hand, is used with nearly every relational database system and huge numbers of people who know how to use it effectively to mine and analyze data. While the Facebook-created Hive provides an SQL-like front-end for Hadoop, it neither implements full SQL semantics nor is it particularly fast—since it simply translates queries into batch-processed MapReduce jobs for Hadoop.

Over the past six months, vendors have responded to the demand for more corporate-friendly analytics by announcing a slew of systems that offer full SQL query capabilities with significant performance improvements over existing Hive/Hadoop systems. These systems are designed to allow full SQL queries over warehouse-size data sets, and in most cases they bypass Hadoop entirely (although some are hybrid approaches). Allowing much faster SQL queries at scale makes big data analytics accessible by many more people in the enterprise and fits in with existing workflows.

Here's a sampling of some of the SQL-for-Big Data initiatives underway:

Amazon Web Services' RedShift. The service provides an SQL-based data warehouse service that can handle queries against databases of up to 1.6 petabytes.

HortonWorks' Stingerinitiative, an effort to improve the SQL interface of Hive and make Hive 100 times faster.

IBM's BigSQL, an SQL query engine for Hadoop. BigSQL bypasses MapReduce and runs against the Hadoop Distributed File System for read-only queries and HBase (the Hadoop database engine) for transactional queries that perform reads and writes of data.

EMC's HAWQ, an SQL query engine for the company's Pivotal HD version of Hadoop.

There are also upcoming changes to Hadoop itself that will make SQL queries of Hadoop data easier. Hadoop 2.0, which will be released later this year, replaces the MapReduce code in Hadoop with a modular architecture called YARN (Yet Another Resource Negotiator) that permits multiple analysis systems to co-exist with MapReduce.

37 Reader Comments

SQL is a Structured Query Language. You can make an SQL language implementation even on non-ACID DBMS. We store smaller datasets in RDBMS, while for bigger ones we use a NoSQL approach. We only import the most interesting "Big Data" subsets into a RDBMS.

SQL is actually a handy interface to Hadoop. MapReduce can be harder to think about and implement than a SQL query. Given its pervasiveness, it makes sense to implement some form of it in this realm as well.

As far as your list of "Big Data" SQL implementations, there is also Google's BigQuery, though it does not run on top of Hadoop.

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.

How would one go about learning hadoop in a practical hands on manner? Are there giant test data sets available like Microsoft's Adventure works db? I have a decent understanding of sql and microsofts rdbms but not any real world experience. I also have a subscription to safari books online if anyone has a suggestion for a book or a place to start. Is the OReilly book any good? Pm s welcome. .. can we pm here? Maybe I should post in the forums.

while Ive never built anything the size of facebook.. I haven't run into a situation where sql (more specifically rdbms tech) hasn't worked well...

im talking apps with hundreds of thousands of users and 10s of thousands of transactions per second

That sounds like an overly broad statement. I've seen a number of designs where an RDBM (Oracle) was proposed as a datastore mainly because it already existed as a deployed technology (familiar to the staff) instead of being the best tool. Strict ACID compliance comes with a performance cost; at some point concurrency of a shared resource is going to bite.

If you can put a SQL interface in front of these "noSQL" databases, suddenly they can be accessed by the many, many tools designed to use SQL and developers trained in SQL.

How would one go about learning hadoop in a practical hands on manner? Are there giant test data sets available like Microsoft's Adventure works db? I have a decent understanding of sql and microsofts rdbms but not any real world experience. I also have a subscription to safari books online if anyone has a suggestion for a book or a place to start. Is the OReilly book any good? Pm s welcome. .. can we pm here? Maybe I should post in the forums.

the oreilly book is pretty good and so is hadoop essentials and hadoop operations IMHO. the web is full of discussion. having some big hardware and a project to motivate you is nice too. hard for me to get much real learning done without a looming deadline, but many are different in that way I suppose

How would one go about learning hadoop in a practical hands on manner? Are there giant test data sets available like Microsoft's Adventure works db? I have a decent understanding of sql and microsofts rdbms but not any real world experience. I also have a subscription to safari books online if anyone has a suggestion for a book or a place to start. Is the OReilly book any good? Pm s welcome. .. can we pm here? Maybe I should post in the forums.

the oreilly book is pretty good and so is hadoop essentials and hadoop operations IMHO. the web is full of discussion. having some big hardware and a project to motivate you is nice too. hard for me to get much real learning done without a looming deadline, but many are different in that way I suppose

Also, Hortonworks has a pretty good sandbox (a premade VM) that has a bunch of tutorials. Several of my coworkers have been using it to familiarize themselves with Hadoop before I deploy our cluster.

Looking atRDMS and "NoSQL" databases I consider them complimentary technologies. Both have data storage/handling problems were they each excel so it is a matter of understanding the tools and using the correct tool.

SQL is langauge that is used by RDMS. Some standard cnstructs of SQL, such as joins, are often not supported by "NoSQL" databases. Could a subset/extension set for SQL be used by a "NoSQL" database? Possibly/ But I think focusing on the query language is incorrect. What should be the focus is what type of database do you need first.

while Ive never built anything the size of facebook.. I haven't run into a situation where sql (more specifically rdbms tech) hasn't worked well...

im talking apps with hundreds of thousands of users and 10s of thousands of transactions per second

I don't think anyone is suggesting using Hadoop for that kind of scenario. Where it excels is when you have many terabytes of data in a few tables that is write-once, read-many. Hadoop is fundamentally an offline, batch processing tool. Granted, there are many efforts to bring it into the realm of interactive-speed querying, as outlined in the article, but even those are not meant to serve thousand of concurrent connections.

And when you truly reach ridiculous scales (like Facebook), Hbase allows you to scale to an absurd level, although you have to be able to design your app to not rely on the ACID features and SQL interface of a traditional RDBMS.

How would one go about learning hadoop in a practical hands on manner? Are there giant test data sets available like Microsoft's Adventure works db? I have a decent understanding of sql and microsofts rdbms but not any real world experience. I also have a subscription to safari books online if anyone has a suggestion for a book or a place to start. Is the OReilly book any good? Pm s welcome. .. can we pm here? Maybe I should post in the forums.

As far as datasets, the government publishes all sorts of things - census, economic, patents, etc. To be honest, though, you don't want a huge one to learn with, as the 'round trip' of coding, running, and seeing results will just take longer. Get a public data set and pare it down to a smaller size.

Thanks for all the suggestions everybody. I really like the idea of a premade VM.

Is there a general demarcation point for when something becomes "Big Data" and starts to require hadoop vs a traditional oracle or microsoft SQL DB offering?

edit*As I read that sentence I am not sure I am asking the right question.I think what I meant was; Are migrations from a traditional DB to hadoop(or other big data solutions) a common occurrence? Is migration something a company would do when queries become slow, and/or they are not returning the expected results?

These days it seems that whenever anyone mentions SQL they're actually talking about relational databases. So let's talk about them. They're not going anywhere.

If you have a big data project, you're going to want to correlate it with other data at some point. You might not control that other data, but the two have a few columns which are relevantly similar. Congratulations- you now have a relational database. Now if only there was some friendly language with which to ask questions of such a database...

Promoters know how to promote. People with data base experience know how to use SQL. Promoters and tech heads all get excited about 'big'. Few people have much idea of how to use any tools to effectively analyze data. But, the reality is that any approach to analyzing structured data is basically going to rely on the kind of programming paradigm that SQL brought into the world. On the other hand, it is hard to see the value of a relational data base as the persistent store for anything other than transactional data. It could also easily turn out that the small minority of participants who turn out to actually be successful in extracting useful information from large data sets turn out to use a variety of tools that don't necessarily cost a large amount of money. It is also the case that our national economic policy manages to ignore the information that is available in relatively small data sets like the non farm payroll and the national income and product accounts.

As has been mentioned previously, SQL is just a language for data access. Any system can be adapted to leverage SQL.

The problem I see is this: SQL was designed (more or less) to act as a front end to Relational Databases (as described in The Relational Model for Database Management v2 - Codd). Using SQL to define/query DBMS's based on other models may be convenient, but sacrifice functionality or usability.

Take graph databases for example. They excel at what the major RDBMS' usually choke on (variable levels of subqueries and recursive queries). SQL can express these queries just fine, but you end up writing pages of query code or toiling with building your query dynamically. A more on point query language would be more elegant (Cypher on the Neo4j graph is a nice example). The code is short, expressive, and useful.

So are NoSQL vendors sacrificing elegance for a quick boost in market share?

I think the mention of the debate on NoSQL vs SQL in this article is out of place. That debate primarily applies to large-scale high-QPS transactional latency-sensitive systems, not data analytics. SQL always was and will be the preferred way to do ad hoc querying and analytics. The problem was that up until recently we didn't have good SQL engines that could process big data. Big data wasn't "bent on excluding [SQL]". It's simply an extremely hard thing to implement. Gradually, people are solving this problem, which is very well summarized in the rest of the article.

No love for Vertica (Now HP Vertica) on the list ( it's a columnar DB)

I think that's the direction big data is going to. Not only because it runs SQL-99 plus a host of SQL based analytic functions but because you can create functions in a language like R. The Database engine will split the R work around N-nodes. I think for a lot of big data folks languages like R really bridge the gaps of traditional SQL.

"Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce."

"Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries."

Although Hadoop has a big head start, Spark has a definitive speed advantage.

Another conspicuous absence: Aster Data nCluster (now "Teradata Aster Database" after they were acquired by Teradata).

Aster is a MPP cluster database, built on a custom Postgres implementation. Like most of the ones mentioned above, processing and data work is distributed across multiple commodity servers. A cluster can have hundreds or even thousands of servers, each running a number of "virtual workers." Large tables get chunked up across all vWorkers via a hash algorithm and smaller tables get replicated on all of them to reduce load on the intra-cluster fabric.

The really cool thing about Aster nCluster, though, is that it has a way of uploading map-reduce functions into the database (they can be written in Java or C++). The functions (called "SQL MapReduce" functions or "SQL-MR" functions for short) can be passed rows from SQL queries and return rows to the database/client, taking advantage not only of the Map-Reduce programming paradigm but the distributed "cluster" database as well. Pretty nifty.

Curiously enough, Aster nCluster doesn't have a built-in stored procedure language (i.e. no PL/SQL, Transact-SQL, etc). I recently founded a startup to create one, and we're in Beta test right now. It's an interpreted language that runs in (and as) a SQL-MR function. That allows you to write code that takes advantage of the MapReduce paradigm right in the text editor of your favorite query tool (assuming it's compatible with Aster nCluster, obviously).

The language itself is fairly simple, by design. There are a handful of "container" types (arrays, maps, etc), and your basic branching and looping constructs. Security was a concern, since the language runs on virtual workers INSIDE the database cluster. For that reason, I didn't simply adapt an existing scripting language. It was too risky to expose every possible object via reflection and/or encapsulation. So instead, I carefully chose a subset of features that would be useful (JDBC capability, XML parsing, some hooks into the Aster nCluster API, etc.) and went with those.

It's not meant to replace Aster's built-in SQL-MR feature, but it does simplify using it. Code written in our language is more compact (by virtue of having all the API infrastructure already built in), and you don't have to write your SQL-MR function in Eclipse, upload it to the database, test, de-bug, go back to Eclipse, lather, rinse, repeat. Our language can be maintained as a file in the database like a regular stored procedure, so in effect, we have created a stored procedure language for a DBMS that doesn't support them. And it's a distributed, massively parallel language to boot

In all honesty, I've never been a fan of SQL. Don't get me wrong -- it's as intuitive as I think it can be..but I would like to see something come out that let's me use even plainer English than "SELECT* FROM blah WHERE blah.foo = 'bar' " etc etc etc.

Sure -- single table SELECT statements are simple and friendly. But any time I have to look at a query with even the simplest of joins, my eyes glaze over and I instantly feel a headache coming on. I wish I could just write something as human as "Get these columns from this table and those columns from that table where the value in this column equals the value in that column". It might be a bit more verbose, but geez it makes it much easier on us programmer types that are too stupid to understand complex queries with lots of joins.

In all honesty, I've never been a fan of SQL. Don't get me wrong -- it's as intuitive as I think it can be..but I would like to see something come out that let's me use even plainer English than "SELECT* FROM blah WHERE blah.foo = 'bar' " etc etc etc.

Don't get me wrong -- single table SELECT statements are simple and friendly. But any time I have to look at a query with even the simplest of joins, my eyes glaze over and I instantly feel a headache coming on. I wish I could just write something as human as "Get these columns from this table and those columns from that table where the value in this column equals the value in that column". It might be a bit more verbose, but geez it makes it much easier on us programmer types that are too stupid to understand complex queries with lots of joins.

I think I see where you're coming from, but I may have gotten you wrong.

I think what you want is advanced natural language processing instead of unnecessary, required syntactic sugar. Try using AppleScript, where its syntax 'prettifiers' may not be required but still leave it in a sort of uncanny valley of language.

In all honesty, I've never been a fan of SQL. Don't get me wrong -- it's as intuitive as I think it can be..but I would like to see something come out that let's me use even plainer English than "SELECT* FROM blah WHERE blah.foo = 'bar' " etc etc etc.

Don't get me wrong -- single table SELECT statements are simple and friendly. But any time I have to look at a query with even the simplest of joins, my eyes glaze over and I instantly feel a headache coming on. I wish I could just write something as human as "Get these columns from this table and those columns from that table where the value in this column equals the value in that column". It might be a bit more verbose, but geez it makes it much easier on us programmer types that are too stupid to understand complex queries with lots of joins.

It must be subjective, because your example is pretty well exactly how my brain interprets SQL queries. I've also come to the conclusion that the people involved in the massive circlejerk over how fantastically intuitive Python is are out of their fucking minds, as PHP makes way more sense to me. Popular opinion on the interwebz tells me I'm the one off his rocker on that count, though. Maybe it's because I have zero formal training and no background in lower-level languages.

In all honesty, I've never been a fan of SQL. Don't get me wrong -- it's as intuitive as I think it can be..but I would like to see something come out that let's me use even plainer English than "SELECT* FROM blah WHERE blah.foo = 'bar' " etc etc etc.

Don't get me wrong -- single table SELECT statements are simple and friendly. But any time I have to look at a query with even the simplest of joins, my eyes glaze over and I instantly feel a headache coming on. I wish I could just write something as human as "Get these columns from this table and those columns from that table where the value in this column equals the value in that column". It might be a bit more verbose, but geez it makes it much easier on us programmer types that are too stupid to understand complex queries with lots of joins.

It must be subjective, because your example is pretty well exactly how my brain interprets SQL queries. I've also come to the conclusion that the people involved in the massive circlejerk over how fantastically intuitive Python is are out of their fucking minds, as PHP makes way more sense to me. Popular opinion on the interwebz tells me I'm the one off his rocker on that count, though. Maybe it's because I have zero formal training and no background in lower-level languages.

In all honesty, I've never been a fan of SQL. Don't get me wrong -- it's as intuitive as I think it can be..but I would like to see something come out that let's me use even plainer English than "SELECT* FROM blah WHERE blah.foo = 'bar' " etc etc etc.

Don't get me wrong -- single table SELECT statements are simple and friendly. But any time I have to look at a query with even the simplest of joins, my eyes glaze over and I instantly feel a headache coming on. I wish I could just write something as human as "Get these columns from this table and those columns from that table where the value in this column equals the value in that column". It might be a bit more verbose, but geez it makes it much easier on us programmer types that are too stupid to understand complex queries with lots of joins.

He? It can´t be more simple than that. You select a specific data from a specific column.

How else would you say that? The query says exactly what someone is looking for in that query.

SELECT (THIS) FROM (HERE)

Not sure how much more simple it can get, if you want to make complex selections and combinations, then you don´t need a database, you need to do this in your coding or software or what ever is processing the data, even while databases support some complex queries, its just a database, like its name says to hold data.

The database cannot possible know what specific result you expect, you get data, filter it and process it once extracted to do what ever end result you want to achieve.

SQL is as simple as it get and can scale in huge ways to fit 99.9% of all companies and users.

In all honesty, I've never been a fan of SQL. Don't get me wrong -- it's as intuitive as I think it can be..but I would like to see something come out that let's me use even plainer English than "SELECT* FROM blah WHERE blah.foo = 'bar' " etc etc etc.

Don't get me wrong -- single table SELECT statements are simple and friendly. But any time I have to look at a query with even the simplest of joins, my eyes glaze over and I instantly feel a headache coming on. I wish I could just write something as human as "Get these columns from this table and those columns from that table where the value in this column equals the value in that column". It might be a bit more verbose, but geez it makes it much easier on us programmer types that are too stupid to understand complex queries with lots of joins.

It must be subjective, because your example is pretty well exactly how my brain interprets SQL queries. I've also come to the conclusion that the people involved in the massive circlejerk over how fantastically intuitive Python is are out of their fucking minds, as PHP makes way more sense to me. Popular opinion on the interwebz tells me I'm the one off his rocker on that count, though. Maybe it's because I have zero formal training and no background in lower-level languages.

I think we've identified a good programming project. Lots of the latest and greatest technologies we have today were spurred by people that weren't happy with the tools they had. Hell, that's exactly what drove John Resig to create jQuery. He was using Prototype to get the job done, but wasn't happy with its syntax. A few months later, he releases jQuery...and now his code powers ump-teen-bazillion websites on the internet.

Sean Gallagher / Sean is Ars Technica's IT Editor. A former Navy officer, systems administrator, and network systems integrator with 20 years of IT journalism experience, he lives and works in Baltimore, Maryland.