How SQL++ Makes JSON More Queryable

Alex Woodie

(Wright-Studio/Shutterstock)

For more than 40 years, SQL has provided a standard way to query structured data. However, much of the data being generated and stored today exists in semi-structured formats, like JSON, which doesn’t “speak SQL.” But now, thanks to an extension of the SQL language called SQL++, developers who are fluent in SQL have an easier path to incorporating semi-structured data into their queries and analytics.

SQL++ was created by computer science and engineering professor Yannis Papakonstantinou and his graduate students at the University of California San Diego to extend the basic SQL concepts to semi-structured data. When Papakonstantinou looked at the emerging database world, he observed that the proliferation of non-relational stores like NoSQL, SQL-on-Hadoop, and NewSQL databases was giving organizations more flexibility in how they stored big data, but came at the cost of needing to learn specialized query languages that were largely incompatible.

“It is hard to understand, compare, and use these languages,” Papakonstantinou and his grad students, Romain Vernoux and Kian Win Ong, wrote in their 2014 technical paper introducing SQL++. “It is especially tedious to write software that interoperates between two of them or an SQL database and one of them.”

The solution was to create a single, formalized syntax and semantic standard that would allow developers to access and query a range of relational and non-relational data stores. This syntax and semantics would be named SQL++, and it would be an extension of SQL and fully backwards compatible with widely accepted SQL standard.

The UCSD researchers write that the SQL++ semantics “stands on the shoulders of the extensive past work from the database R&D community in non-relational data models and query languages,” including OQL, the nested relational model and query languages, and XQuery. “SQL++ is ‘unifying’ in the sense that it is explicitly designed to encompass the data model and query language capabilities of current SQL-on-Hadoop, NoSQL and NewSQL databases,” they write.

One might expect the SQL and relational database community to bemoan the creation of yet another language to query semi-structured data. If people would just normalize their data and store it in rows and columns like God and Ted Codd intended, darn it, then everything would be just fine!

But exactly the opposite is happening. In fact, none other than SQL co-creator Don Chamberlin recently gave a big endorsement of SQL++ at the recent Couchbase Connect SV event in San Jose, California.

Chamberlin – who created SQL with Ray Boyce to, in effect, “put a friendly face” on the relational database concept created by Codd, whom he and Boyce worked with at IBM’s Almaden Lab in the early 1970s – talked about the experience that led him to support SQL++ during a panel conversation with Couchbase CTO Ravi Mayuram and AsterixDB creator and computer science professor Michael Carey at the Connect SV show.

“The basic idea of SQL++, as I understand, is to extend SQL and turn it into a general purpose query language for JSON,” Chamberlin said. “Well, nobody at Couchbase is going to be very surprised about that, because that’s what we’ve been doing for years now and that’s exactly what N1QL is, a general purpose query language for JSON.”

In fact, it wasn’t necessarily obvious that SQL would be the right language for querying JSON in the first place. There were other options on the table, including XQuery, the XML-based query language that Chamberlin worked on with the W3C.

Gerald Sangudi (left), who led Couchbase’s development of N1QL and Yannis Papakonstantinou (center), creator of SQL++ (Photo courtesy UCSD)

However, there were clues that led folks to believe that SQL (or a SQL-based language) was ultimately the right answer. For starters, Chamberlin notes, the industry has made a huge investment in SQL. There are many people with the skills to write SQL, and many products that generate SQL, including the majority of the business intelligence and analytic tool vendors.

“It would be very nice if we could preserve all of that investment, while meanwhile gaining JSON capability,” Chamberlin said. “I think the relationship between the JSON format the relational format is a closer one than the relationship between JSON and XML. XML is a much more complex and less highly structured language than JSON is.”

Carey seconded that notion. “If you throw enough stuff away from XML you get something that looks like JSON,” he said. “And if you throw enough stuff away from Xquery, you can get a pretty decent language.”

Couchbase has funded the work done by Papakonstantinou and his team at UCSD since 2015. The fruit of that SQL++ labor was a Couches language called it N1QL, which stands for “non-first normal form query language” and pronounced “nickel.” N1QL and SQL++ are essentially the same, and the technology features prominently in the recently announced Couchbase Analytics.

Couchbase Analytics combines the work in N1QL and SQL++ along with the work on Apache AsterixDB that’s been spearheaded by Michael Carey at the University of California Irvine and University of California Riverside. Whereas N1QL/SQL++ provides the query language, AsterixDB provided the parallel processing and execution of those queries as part of a NoSQL take on the traditional MPP database.

“We’re seeing Couchbase Analytics take advantage of all these threads, pulling in the work that’s being done in San Diego and Irvine and Riverside and kind of putting it all together into a unified system,” said Chamberlin, who is employed as an advisor to Couchbase. “I thought that was a pretty exciting story, so I thought I had to write a book. This book, “SQL++ for SQL Users: A Tutorial,” goes over the extensions that were necessary to turn SQL into a JSON query language.”

Chamberlin — who last year said that “JSON looked a little bit like tables,” although you “sort of have to squint” – boiled down the SQL++ extensions that allow SQL to work with JSON data into three items: the “left correlation,” “select value,” and the “group as.” “Each one of them is a small language feature with a little bit of syntax that goes with it,” Chamberlin said. “I was surprised to learn that these extensions were not very large or very complex.”

That prompted Couchbase’s Mayuram to interject. “Don’t get fooled by what Don says. ‘A simple extension of three things.’ That’s how it appears to him,” he said. “But the beauty of this book is how it simplifies things. The examples are to the point and they actually have great clarity.”

Whether or not you buy the book 143-page book, which is also available as an e-book from the Couchbase website, Chamberlin is bullish on the prospects for SQL++. And just as C++ is a superset of everything in C, the new language keeps the core of SQL intact while relaxing its constraints and extending the language to handle JSON data.

“I like the name SQL ++,” Chamberlin said. “It suggests…it still does what it used to do, for the data it was designed for. And at the same time it can handle additional kinds of data with additional expressive power, and so I think this is all part of the same story.”