Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:

In most cases, no.

Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?

MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.

There’s one big countervailing factor to all these generalities — schema flexibility.

As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. First, you might be fine with the idea of a (somewhat) nonprocedural, schema-aware DML/DDL (Data Manipulation/Description Language), but just think another kind is better, or more suited to your use case. If your reason is like that, you might favor alternatives such as:

OLAP-based languages such as MDX.

XML-oriented languages.

“True” relational languages, because SQL deviated from the path of relational virtue under the corrupt influence of IBM — aka “Blue Babylon” — and the IT world has been languishing in sin ever since.

The second class of reason for avoiding SQL is because you don’t like the idea of a separate schema-aware DML at all. Possible reasons for this orientation include:

You just like to program, and want to manipulate stored data the same way you do anything else. Thus, you are bothered by an “impedance mismatch” between SQL and your favorite programming languages. This is real. It also has been overcome by many, many enterprises around the world.

You believe that more procedural alternatives are a better fit for cloud computing and extreme scale-out on failure-prone commodity hardware. Facebook made that case to me. However, I have trouble thinking of very many enterprise scenarios where it applies, especially when one considers electricity costs and the like.

Your schemas change more quickly than your data architects can reasonably be expected to keep up with. Facebook made that case to me too. Enterprise examples might include marketing campaigns and M&A. I’ve long thought this to be a legitimate, looming concern. But I don’t know that stripped-down DBMS are the way to address it.

You believe that SQL has severe processing overhead. In most enterprise use cases, that would just be bogus.

You lack familiarity with SQL.

That last point is not a joke. One of the weirder database architectures I know of is the one underlying Guild Wars. Its developer — a brilliantly impressive guy — told me flat-out that he learned in college how to build a DBMS, but he didn’t learn how to develop for a conventional one. This was instrumental in his decision to build an unconventional data management architecture that uses SQL Server as little more than a smart file manager.

The questions of SQL performance and — often-unspecified — “overhead” are interesting to view through the lens of the H-Store/VoltDB project. Mike Stonebraker et al.:

Are building a scale-out-oriented OLTP DBMS that is meant to run in RAM, preserving data through replication to other servers’ RAM more than through output to disk.

Believe that 95% of what a typical SQL DBMS does to manage OLTP is wasteful overhead

Originally planned to not use SQL, but wound up going with SQL because alternatives were insufficiently performant.

Mike himself, of course, has been all over the spectrum on SQL-like languages. First he favored QUEL vigorously over SQL for mainstream relational DBMS. Then he led the charge to extend SQL in PostgreSQL, Illustra, et al. Then he actually staked out a contrarian position in the area of complex event/stream processing by favoring a SQL-like language in an area where other alternatives were better established — but that was at what turned into StreamBase, which now emphasizes visual programming over any kind of coding language.

I need to write much more about schema flexibility, but tonight — which will be my third straight of <<8 hours sleep — is not the time for that.

Comments

30 Responses to “NoSQL?”

Guy Bayes on
July 1st, 2009 1:02 pm

I think some of your points are valid, but others are a little straw-man for my taste. I think the main logic pitfall is framing that “alternative” means ” total replacement for”. As usually is true, both extreme viewpoints are fallacies, nothing is a panacea and the truth lies somewhere in the middle.

For example Map/Reduce is not a “total replacement for” SQL and RDBMS however it is a viable alternative for some problems currently being solved in relatively inefficient means by SQL and it’s various bastard children (PL/SQL, COBOL< flavors of ETL and the rest).

This line was especially silly.

“*You just like to program, and want to manipulate stored data the same way you do anything else. Thus, you are bothered by an “impedance mismatch” between SQL and your favorite programming languages. This is real. It also has been overcome by many, many enterprises around the world.”

Has it really now? Been completely overcome in a completely efficient way? Then why do we still have more lines of Cobol in the world then all other languages combined?

SQL is a very useful language. It is a powerful language. It is very well suited for many tasks. It is completely unsuited for many tasks. Some of the tasks it is unsuited for are data related and analysis related. The industry has tried to solve these issues in a number of less then totally successful ways mosltly through a stable of pretty popular data analyis and processing tools that are not SQL based.

The new technologies that are coming into play do nothing more then add to that stable. They are doubtlessly being oversold as SQL replacements. However, the certainly are useful and will doubtlessly perform some data and analysis jobs better then SQL. Enterprises will find benefits in migrating some things.

MapReduce is all about throwing massive amounts of resources at a problem in order to get it done in a short amount of time.

MapReduce is not “green” and it represents the opposite direction (IMHO) from where the industry needs to move.

If you want to be green, you have to do more with less. If you want to attract existing users, then you need to solve existing problems, which are mostly with SQL. I really don’t think most shops are going to gain ground by abandoning the query language spoken by just about every database developer and integration product in the world.

XSPRADA may be an exception there, given the unique way it uses algebraic relationships to manage extended sets. I internally try visualize what they do using “bag algebra” but I don’t fully understand it. I’ve played a bit with it, and it is definitely an interested technology to keep an eye on.

Seriously, Mike Stonebraker delivered a blistering critique of SQL at SIGMOD today, for reasons not unlike your comments. Unfortunately, I didn’t ask him to expand on it afterwards …

Bo Reid on
July 1st, 2009 10:37 pm

I’ve actually used a commercial Non-SQL, Object Oriented analytical database that uses a LISP variant to access data. Although the learning curve was steeper, especially for those without any OO experience, it was vastly faster to program and at least 10 – 100 times faster in running time series related analyses.

The database name is Vision and it was acquired by Factset Data Systems, and was marketed as “FAST”. It was in use at some of the the top 50 asset management systems in the world.

Every time I look at PL/SQL or Transact-SQL of code that is longer about 200 lines, I start to sigh and start whining to the nearest person about how great and easy things were with this non-SQL system.

I read that Pick programmers feel the same way.
I’d be curious to see anyone believes that a non-SQL oriented, real 4GL analytical language will ever be adopted outside of narrow industry niches and replace SQL for transactional and analytical purposes. I don’t see this happening. I think it is about as likely as English drivers voting to drive on the right side of the road.

NoSQL is about scaling, particularly scaling writes, not about antipathy towards an ancient query language. You can’t have both just-add-machines-and-push-a-button scaling of writes and traditional ACID SQL support; Google’s GQL on App Engine perhaps comes closest but even that is only a superficial resemblance.

So I’d have to disagree with the original post and say that these solutions — some more than others — are useful for anyone with massive write volumes.

“XSPRADA may be an exception there” – I think so, because another thing we dont do is “load” data to begin with. We only ingest it when queried and only then do we do data conversions as needed. So we dont convert data types from the get go. This is why it doesnt matter if you store everything as text in a CSV file (notwithstanding the obvious size penalties involved). At the end of the day we only convert JIT mode.

I agree with Guy Bayes, the impedance mismatch is very much there and while there are workaround and tools for binding these layers, these all come with a price to pay. I rather like how Ted Neward refers to this issue, namely the Vietnam of Computer Science.

Part of the difficulty with the NoSQL crowd, I think, is that they mischaracterize the very nature of enterprise computing which is very structural and procedurually oriented. Scaling writes doesn’t help manage workflow or handle complex business rules and those are the kinds of issues that harangue folks in enterprise space.

It’s true that relational database schemas are relatively inflexible and transaction processes get locked in because of this, but that problem has been largely solved by column based databases. It’s ironic that Stonebreaker is referenced here because he’s involved in Vertica, a column based database built for the cloud that uses an SQL dialect and Map Reduce.

From my perspective, someone who has worked extensively in data warehousing and analytics with the best of relational and multidimensional technologies, there is much to admire and fear in the scalability leapfrogging that the webguys have accomplished. But I think very few of them have dealt with workflows any more complex than the path of an ecommerce shopping cart. The skills to replace relational technologies requires much more of the mindset of game programmers than of web programmers.

I say this also because I too see the impedence mismatch as the key problem, considering all the new jive in the middle tier over the past decade. A real “4GL” integrated programming environment is what some of these NoSQL guys need to live in building something like a medical patient records application. I think that will throw a bit of water on their fire.

I think game programmers have, at a high level of abstraction, an implicit understanding of the roles in which players are involved and how they can mature in the context of and expansive and dynamic environment. Whereas corporate IT and the enterprise software developers who build products for them are very narrowly focused and have no contextual idea of how people work in multiple environments.

Consequently, the user experience of integrating data from one area of the business is tedious and gets tiresome very quickly, whereas one can play videogames in completely alien environments for hundreds of hours and enjoy doing so.

Take a well understood business concept like supply chain or inventory management. For something as simple as visualization, have you ever seen anything that resembles creativity used to present the current inventory levels in a company’s warehouse? I haven’t. It has been a bar chart for 20 years.

I have yet to see, in a Fortune 500 company, any BI that functions in realtime with as much creativity and immersion as Maxis did with A-Train in 1992. If I presented to a corporate financial planning and analysis group some UI with an interface as complex as.. oh say the main screen of an XBOX 360 I’d either be hailed as a genius or shot on sight.

The status quo for organizing the finances of a million businesses is a warren of Excel spreadsheets in a complex nest of shared file servers. Same as it was in 1990. You can’t even visually view the links between the spreadsheets. Much less build abstract ways of dealing with the workflow.

The next generation of mouse, thumb and touch literate users are not going to stand for it. The only question is when is the revolution going to begin.

@Cubegeek: “It’s true that relational database schemas are relatively inflexible and transaction processes get locked in because of this, but that problem has been largely solved by column based databases.”

I dont see how columnar engines have addressed this problem — they remain relational databases. The fact that they manage storage in columnar fashion doesnt mean they’re either non-relational or more flexible with schemas. I dont think anyone is this columnar space is trying to replace relational.

“…he’s involved in Vertica, a column based database built for the cloud that uses an SQL dialect and Map Reduce.”

I dont believe Vertica was built for the cloud at all. Initially there was no talk of any SaaS or on-demand offering. They jumped on the bandwagon conveniently when possible. Furthermore, what does it mean to be “built” for the cloud when essentially at issue there is delivery mechanism more so than the actual backend engine IMHO.

I’m taking people at their word that Vertica is as flexible as Oracle (Hyperion) Essbase when they draw equivalences. Essbase is very flexible when it comes to adding ‘columns’ on demand. Microsoft Analysis Services is as well to a lesser degree.

I think hub & spoke methodologies will be adapted to the new super scalable storage technologies. I’m trying to think of ways to build systems using optimized combinations of databases – but I’m a database guy so that figures.

When I think of ‘built for the cloud’ I’m thinking about using smart partitioning logic that scales across racks of relatively small servers and some parallelization. Devil’s in the details. But I cannot get over the prejudice, and please tell me if I’m wrong, that NoSQL is adapted by people who think first of MySQL as representative of relational tech – ie guys who build websites.

Vertica was designed from the ground up to be MPP. They’re in the “Whatever else was missing in the Release 1, we got MPP right the first time” camp, along with Aster, rather than the “MPP is hard but we got it right after a couple of tries, and by the way we’ve been doing this longer than the other guys” camp, where Greenplum sits.

Among the new vendors, the former group tend to have more compelling automatic-redistribution-among-nodes stories than the latter group, but it’s not clear the edge is a big one.

I do a lot with full-text search and that is a case where it just doesn’t match the architecture of SQL-oriented DBMSs. Most of the content is unstructured and almost all that is tagged or fielded is text rather than numeric. Inverted indexes are hugely more efficient for fast lookup than anything else, and the transaction code in DBMS is enough of an overhead to kill performance. Sure, there have been SQL-like query languages, but for the raw work of keyword search, SQL just doesn’t work. (And if you want to get me really ranting, let me tell you about the so-called MySQL full-text search, it strinks.)

Guy Bayes:”Kurt, we agree, poor choice of names on their part 🙂http://nosql.eventbrite.com/”
yes, this is called NoSQL, but it is like about non-relational databases, not about not using SQL: http://nosql.eventbrite.com/ :’This meetup is about “open source, distributed, non relational databases”.
Have you run into limitations with traditional relational databases?’, http://blog.oskarsson.nu/2009/06/nosql-debrief.html :’The idea was to give attendees a solid introduction to how distributed, non relational databases work as well as an overview of the various projects out there.’.
What about whether SQL language is really needed to use with relational databases?
And i thought this way:
Curt Monash: “You just like to program, and want to manipulate stored data the same way you do anything else. …” and “You lack familiarity with SQL.”
i have asked about that in #php irc channel and have published the talk: http://qdb.wp.kukmara.ru/2009/07/15/is-sql-language-really-needed-to-c-like-language-to-work-with-db/ – and there are some c-like syntax instead of SQL.

[…] Avoiding joins is a big deal because a lot of programmers didn’t learn SQL in school. Also, joins can be computationally expensive. I wrote about some of the problems with fixed schemas here and, specifically in connection with NoSQL, here. […]

There are few other points that may explain the rational behind the NOSQL movement.

* Actual disk failure/year is 3% (vs. estimates of 0.5 – 0.9%) – this is a 600% difference on reported vs. actual disk failure.
* There is NO correlation between failure rate and disk type – whether it is SCSI, SATA, or fiber channel.
* There is NO correlation between high disk temperature and failure rates

Those analysis shows that the approach of relying on a shared storage for reliability as with most RAC clusters is broken. Instead NOSQL approach assumes that failure are inevitable and where designed to deal with those failure under extreme scenarios.