Posted
by
kdawsonon Thursday September 06, 2007 @12:27PM
from the long-in-the-tooth dept.

jpkunst sends us to Computerworld for a look at Michael Stonebraker's opinion that RDBMSs "should be considered legacy technology." Computerworld adds some background and analysis to Stonebraker's comments, which appear in a new blog, The Database Column. Stonebraker co-created the Ingres and Postgres technology while a researcher at UC Berkeley in the early 1970s. He predicts that "column stores will take over the [data] warehouse market over time, completely displacing row stores."

Since when is a column store database and a relational database mutually exclusive concepts? I thought that both column store and row store (i.e. traditional) databases were just different means of storing data, and had nothing to do with whether a database was relational or not. I think the article misinterpreted what he said.

Also, I don't think it's news that Michael Stonebraker (a great name, by the way), co-founder and CEO of a company that (surprise!) happens to develop column store database software, thinks that column store databases are going to be the Next Big Thing. Right or wrong, his opinion can't exactly be considered unbiased...

Since when is a column store database and a relational database mutually exclusive concepts? I thought that both column store and row store (i.e. traditional) databases were just different means of storing data, and had nothing to do with whether a database was relational or not. I think the article misinterpreted what he said.

Agreed. It definitely looks like a storage preference. Though column-based storage has definite benefits over row-based when it comes to store once, read many operations. Kinda like what you'd find in a data warehouse situation...

Also, I don't think it's news that Michael Stonebraker (a great name, by the way), co-founder and CEO of a company that (surprise!) happens to develop column store database software, thinks that column store databases are going to be the Next Big Thing. Right or wrong, his opinion can't exactly be considered unbiased...

Column stores are great (better than a row store) if you're just reading tons of data, but they're much more costly than a row store if you're writing tons of data.

Therefore, pick your method depending on your needs. Are you storing massive amounts of data? Column stores are probably not for you...Your application will run better on a row store, because writing to a row store is a simple matter of adding one more record to the file, whereas writing to a column store is often a matter of writing a record to many files...Obviously more costly.

On the other hand, are you dealing with a relatively static dataset, where you have far more reads than writes? Then a row store isn't the best bet, and you should try a column store. A query on a row store has to query entire rows, which means you'll often end up hitting fields you don't give a damn about while looking for the specific fields you want to return. With column stores, you can ignore any columns that aren't referenced in your query...Additionally, your data is homogenous in a column store, so you lose overhead attached to having to deal with different datatypes and can choose the best data compression by field rather than by data block.

So it seems to me the -real- money is in integrating an RDBMS which, for usage purposes, is row-oriented; but which, for archival purposes, is column-oriented. This could either be a backup-type thing, or an aging-type thing. Quick, to the Pat(ent)mobile!

Maybe, but I doubt it. The money is in the data warehouse market and the etl tools that move the data from the oltp environment to the warehouse environment. I think what the author points out is not that people are trying to use the same database to do both, but rather that they are trying to use the same product to both. He says it would make more sense to use Oracle (for example) for oltp - and something else for the warehouse, rather than trying to get Oracle to do both well.

The sane thing to me seems to be to just skip the oltp step since the data contained therein is a strict subset of the data in the warehouse. Let's say you design a relation database for a company. You have a table with Employees and another with their Positions, a Project table and so on like the orthodox relational model tells you to. Which works great until someone quits, then you have a problem.

Naturally, you don't want to delete that person because then you lose lots of important archival data. So y

Not sure how that follows...The column style database would be functionally identical to the row style database, only you'd have column proliferation instead of row expansion.The easiest way to deal with proliferating events is to create a very simple table that has a timestamp, your basic audit information (user who made the change, change the terminal was made from, etc), and the change itself.

So say Bob makes 50,000 dollars. This entry was put in the table when he was hired and contains bob's employee re

The answer to what you're describing is not to give up relational dbs, but to design your schema correctly.There are tried and true approaches to the problems you describe; several actually for most of them, depending on your needs.

If you run into these problems due to an evolutionary growth into these features, then its time to stop, take a step back, and re-architect your schemas to handle these needs from the get go.

There's no reason at all to resort to hacks like stored procedures and triggers. These a

Cell-based storage!!! Best of both worlds!!! Mix of both Row and Column based storage, how can we go wrong!

You are years late. The PICK operating system/db already does that. Back in 1985 I used the DOS based Advanced Revelation to write GAP accounting packages. It used the ASCI 253 character to separate "columns" of data in a cell. Reading and writing was parsed automatically. Invoice information, for example, was stored in a Customer's info table, not in a invoice table, and doing a query on accounts receivable produced very fast results. Symbolic dictionary definitions containing formulas allowed for easy column and row totals.

In fact KDB/K looks a lot like a PICK system that uses FORTH as the language.

Just to add to that - PICK/D3 is called a "multivalue" database, and there are a lot of interesting kinds of databases out their that use this "multidimensional"/sparse array storage concept, such as the ancient MUMPS legacy system/database (now implemented as Intersystem's Cache product) found in old Veteran's Affairs (VA) systems in the US.I actually wonder if some of the current databases such as Microsoft SQL Server, etc. aren't going to actually start morphing into these older styles of databases due t

PICK has some very serious disadvantages over the relational databases, not the least of which is the rather annoying property that queries often do not return *ALL* of the related records unless the person writing the query has a very intimate knowledge of the data cubes and how the database is connected and even then there are no guarantees. The brilliance of atomicity defined by Codd and the query guarantees that one is able to make when the "one value per cell" rule is enforced are not of trivial benefi

I went back and read the original article. To Michael Stonebreaker's credit, the ComputerWorld article (and the submitter) grossly misrepresents what he said.

He did not say that RDBMSes are "long in the tooth." He said that the technology underlying them hasn't changed since the 1970's, and that column stores is a better way to represent data in certain situations. In fact, the very name of his original column was "One Size Fits All - A Concept Whose Time Has Come and Gone"

The "relational" concept existed before SQL, really. It's just a question of whether you want to do the work on the client side or the server side.The modern RDBMS is good when the pipe from client to server is much smaller than the pipe from server to backing store/cache. Minimal communication for maximum results. The trade-off, of course, is that the server needs lots of resources because it's doing significant work on behalf of every client.

Traditionally perl-objects are hashes with one blessed hash per instance. The hash contains all the instance variable values using their names as keys.

instead one can use blessed scalars holding a single integer value for instances and let the class variable contain all the instance data in arrays indexed by the instances scalar value.

This technique was originally promoted as an indirection to protect object data from direct manipution that bypassed get/set methods. But it also allows the object to be either row or column oriented internally. that is the class could store all the instance hashes in an array indexed by the scalar. or it could store each instance variable in a separate array that is indexed by the scalar value.

Thus the perl class can, on-the-fly, switch itself from column-oriented to row-oriented as needed while maintaining the same external interface.

Of course this is not a perl-exclusive feature and it can implemented in other languages. It just happens to be particularly easy and natural to do in perl.

I don't think that has anything to do with the article. That is about storage on disk, not about manipulating pointers in memory to such an extent that a programming language that should never have been invented in the first place becomes even more ununderstandable.

You are exactly right and this is backed up by the home page for c-store [mit.edu]. It says: "C-Store is a read-optimized relational DBMS " - c-store is the open source project that apparently is the basis for Vertica - Stonebraker's commercial offering.

Yeah, I looked into Vertica. Was pretty excited at first as my company does alot of data warehouse and data mart stuff. The potential performance gains were significant enough to start looking at converting alot of our reporting, ect to it. So I gave them a call and started asking some questions related to the key usage one would expect to make of a "warehouse" or datamark type of database.

1st Q: "Can you run MDX queries against the Vertica DB?" A: "No, we don't support MDX queries."

I wish we could put this thing to rest once and for all. And I wish so-called "experts" in the field actually were.

Rule of thumb:- you use row dbs for OLTP. They're great for writing.- you use column dbs for data mining. They're amazing for reading aggregates (average, max, complex queries...)

The major problem with column dbs is the writing part. If you have to write one row at a time, you're screwed because it needs to take each column, read, insert into it and store. If you can write in batch, the whole process isn't much more expensive. So writing a single row could take 500ms, but writing 1000 rows will take 600ms.Once the data's in, column dbs are the way to go.

So, what we need is a morphing database structure depending on need. While I am writing, I want it to be row-based. As soon as I am done and ready for reading, it should switch to a column-based structure. Who's up for starting a new project on sourceforge.net? The shape-shifting relational database! SSRD is the wave of the future!

Since when is a column store database and a relational database mutually exclusive concepts?

It doesn't. The original blog is about Row-oriented DBMS vs. Column-oriented DBMS, and the author of the article (or his know-it-all-better editor) confused himself enough to believe somebody abbreviated that as RDBMS which of course means Relational DBMS. The submitter probably not reading the Wikipedia article he linked to didn't help either.

Not Quite. Stonebraker was THE core developer on Ingres which was the second relational DB created (System R was the first developed by Codd et.al using SQL). He operated at both ends of the spectrum. Gray/Mohan did the same as well. While System R disappeared, Ingres was developed into a major company (ingres) and was ultimately bought by CA.
Later, Stonebraker's work on postgres (theorey AND code) was how to handle different datatypes within databases. He took an OO approach to that. That was directly used in Illustra and then went on to Informix. More importantly, Oracle used a lot of that work to create 8i as has other DBs. IOW, he IS a leading theorists AND knows the code.

Considering that he has been on top of all the major advances within the DB world, why would you discount what the man says? As it is, you mention Gray and Mohan who both did some good work at IBM, but have not really advanced DBs forward that much. They simply moved relational model DB forward( Bascially, they were red herrings). But Stonebraker is working across ALL the spectrums and contributes heavily to knew models. His work is everywhere.

Finally, think about what he says. The column major is more useful for data warehousing BECAUSE it allows for data to be compressed quickly, tighter (which makes sense), AND allows you to work with just the data that you need. In a row major, you will end up creating and maintaining indexes to increase the speeds of reads. But an index is for the most part a single (or just a few) columns, which basically makes them a column major. But this requires LOADS of cpu and space to maintain. The column major approach simply keeps the indexes, if you will and discards the rows. This allows for FAST operations if you are doing LOADS of reads, and little changes. That is PERFECT for data warehousing.

So armed with that knowledge, exactly WHY would you discount his work and his statements?

And the important part left out with all the sensationalism: the API wouldn't change. It would still be SQL. It would still be an RDBMS. It would essentially just have a different storage engine. Products like MySQL seem to get along just fine supporting multiple options for storage engine. I don't see why an additional option to improve performance in certain cases would obsolete anything.

Is there a dual-mode db, that lets you create a row-based or column-based "table"? I imagine cross-mode queries would kill performance, but at least you could have a system front-loaded with row tables, where data comes in, and then archive this data over time into the column-based tables, so that reads were fast.

I believe you can build a storage engine in MySQL that deals with column-based storage. I'm not sure if it's been done yet, but I don't see why it couldn't be.

The FA threw me for a loop a couple of times, I honestly _did_ try to read it:) Correct me if I'm incorrect, but wouldn't having a service for column stores be (usually) not needed on most Unix-like platforms? Since this is mostly reading, I would think such efforts might be better spent on sqlite (or similar)?

It can be done with MS SQL. and other you build a cursor that creates a SQL Call. and you execute the SQL Call mixing case statemts and agrate functions and you are all set. Is it Easy no not really is it supported no but it can be done, and I have done it.

You can get MS SQL Server to store tables differently than the default? It will write columns to disk as opposed to rows? You can store columns in their own files? It's been a few years since I worked with SQL Server - but I really don't remember those features. Is it a SQL Server 2005 thing?

Yes in Windows 2000 if you don't want to look behind the curtains.No but you can make views and stored procedures that can do the trick, that makes it look like it and aids in programming and can save time.

Actually yes it is a SQL 2005 thing although there was a way to do it SQL 2000 called Data cubes from my understanding. You end up having multiple data files just like you would in an Oracle situation. It's easier to explain in Oracle terms as you'd just create a tablespace for column based tables and a tablespace for row-based tables. Then away you go, both storing files as you see fit.

I guess in SQL 2005 terms you'd be creating another database on the same server and just use server linking to get your

Ok, think about it this way:If you are doing killer aggregates (tell me the sum of the sales in every month for the last 25 years), you are going to be limited by possibly 2 things: CPU cycles and disk I/O throughput.

There are several ways of addressing these issues. Basically this means either optimizing or parallelizing. Column-oriented stores are likely to help optimize the disk i/o throughput so you can just thow more processor effort at the problem.

That would actually be a smart way to handle it. Lots of apps write only to a single master (or its fallback) and read only from the slaves already. If you had a row-based master for fast writes and replicated to column-based slaves that could be a real win.

RLE on the data columns is a pretty big win for column-based stores, too. If the slaves manage RLE during a replication, you could have one hell of a DB farm.

From what little exposure I've had to MySQL - it can't do much of anything for you or anyone else. I think if you move to PostgreSQL you will find that it will do your dishes for you and make you a better person.

>From what little exposure I've had to MySQL - it can't do much of anything for you or anyone else. I think if you move to PostgreSQL you will find that it will do your dishes for you and make you a better person.

Well it stores my data and meets my performance requirements. Is there something else I need it to do, given that I already own a dish washing machine.

...is duping [slashdot.org] himself [slashdot.org] and thus Slashdot is duping the stories by extension.

Stonebraker has been pushing the concept of column-oriented databases for quite some time now, trying to get someone, ANYONE, to listen that it's superior. While I think he has a point, I'm not sure if he really goes far enough. Our relational databases of today are heavily based on the ISAM files of yesteryear. Far too many products threw foreign keys on top of a collection of ISAMs and called it a day. Which is why we STILL have key integrity issues to this day.

It would be nice if we could take a step back and re-engineer our databases with more modern technology in mind. e.g. Instead of passing around abstract id numbers, it would be nice if we had reference objects that abstracted programmers away from the temptation of manually managing identifiers. Data storage is another area that can be improved, with Object Databases (really just fancy relational databases with their own access methods) showing how it's possible to store something more complex than integers and varchars.

The demands on our DBMSes are only going to grow. So there's something to be said for going back and reengineering things. If column-oriented databases are the answer, my opinion is that they're only PART of the answer. Take the redesign to its logical conclusion. Let's see databases that truly store any data, and enforce the integrity of their sets.

...duping himself and thus Slashdot is duping the stories by extension.

I read the blurb and thought "Haven't we had the same 'debate' over the same guy a bunch of times before?" The name stuck in my head as I always envision the former Notre Dame linebacker [cstv.com] and his famously low GPA turning to a career in database architecture.

Or in other words column wise read would be fast but write slow. I could see this being an option within DBs in the future. As it is just a data layout problem not a language problem.

An interesting idea for improving database technology is to actually change the way that database data is mirrored in a disk array. Rather than writing EXACT duplicates of the data, perhaps one set could be written in row-oriented form, while the other set would be written in column-oriented form. This guarantees that the data

You've all heard of the IBM product called DB2, right? So what was DB1?
Answer: IMS, which is a hierarchical database. They were a pain in the ass to use--PSBs and all--but they were/are faster than hell and I doubt any company is going to throw them out for any reason.
Same goes for relational databases. They're going nowhere. Sure, we have room for more but nobody is going to displace the RDBMS anytime soon.

From a standard 3rd generation programing language one can read and write into flat files and we can do close to this with a hierachical database.

We lose this with relational databases because the way the database organises data has no direct mapping to the way it might be set up in a standard programming language.

What this means is that every transaction to and from the database must go through a literally horrible re-mapping. IE. The language data structures do not correspond to the RDBMS data structures and visa versa.

As an example - in postgreSQL the last I looked at writing a simple row into a table where there were something like 100 columns in the row...

In the 3rd generation programming languages this was just a simple structure with 100 entries.

The data transfer from that structure generated a function call with more than 1000 parameters. This was to be mapped and re-mapped with each call to transfer data, this is even though the structure itself is static and determined at compile time.

The thing is such a table could be set up very easily and populated with a simple loop that rolls in the required values via say a mapping function with about 10 arguments. This could be done ONCE at run time to prepare for the transfer of data and then the same table could be referenced for each call and simply an address could be sent with the transfer.

Noooo.. It was dynamic and the data was encoded as parameters on the stack. This means the stack must be build and torn down and rebuilt for each call.

Next - the implementation was so bad that the program would run in test mode with only a few parameter but it failed when the whole row was to be transfered.

I gave up on that interface.

---------------

Oracle had pre-compilers. They did the same damn thing. The code generated by the pre-compilers was just awful.

---------------

While there is much good to say about RDBMS's in general. The issue I ran into was the interface from 3rd generation languages took a HUGE step backward. IMHO we should have a high level language statement called DBRead() and DBWrite(). In C this should generally correspond to fread() and fwrite(). If this is too complex then DBWriteStruct() could be implemented with suitable mapping helper function.

Nooo...

In the old days one could read and write into a flat file at a given location with a single statement or function call depending on the language. Of course "where" to read and write became a real issue and I do fully understand the complexity of file based tree structures and so forth, especially since I wrote a lot of code to implement these algorithms.

The thing is now we have RDBMS and other solutions that give us the data organisational abilities we need - and we lose the ease of mapping these structures into a suitable structure or object in the programming language.

I for one do not think we have stepped forward very far at all.

-------------

I'll toss in a case in point made by a good buddy of mine who just happens to be one of the top geophysical programmers in this city.

One of his clients was running an application hooked to an Oracle database running on a fast SUN. Run times were measured in close to a day.

Finally they removed the Oracle interface and replaced it with a glorified flat file. They clearly built in some indexing. The result is the run times dropped to under 20 minuets.

As my buddy says - He will NOT use any RDBMS. He can take 5 views of the data comprising 1000's of seismic lines and the user can click on any trace number, line number, well tie and so forth and in real time he can modify all views of the data on as many as 5 s

I think a place to start is to ask how to map language structures and RDBMS structures into a common denominator. One should never be looking at function calls with over 1000 parameters. That is just plain stupid. One should also never be dynamically mapping each and every tidbit of every field in a row on the fly at run time and especially so for each row in a table.

Quite right, which is why programmers who still have have their sanity use JDBC or DBI. This part of your problem has already been solved at least twice.

Ouch. And this exactly why SQL should die as the primary interface to the RDMBS. How the hell is my compiler going to help me find simple typing errors when the interface to use the RDBMS is built upon... strings! You are generating code from code, that's not integration, that's a hack! It's useful, as there is no sound other way to approach an rdbms, but it's not a pretty sight.

We lose this with relational databases because the way the database organises data has no direct mapping to the way it might be set up in a standard programming language.

What this means is that every transaction to and from the database must go through a literally horrible re-mapping. IE. The language data structures do not correspond to the RDBMS data structures and visa versa.

In LedgerSMB, we solved this by putting a functional interface in the db. Then we dynamically map the objects and their properties into functions and arguments. Works great:-)

As an example - in postgreSQL the last I looked at writing a simple row into a table where there were something like 100 columns in the row...

You are either trolling or you need to fire the DB architect who designed that. THere is *no way* that a 100-column table is good DB design. (Ok, mathematically, there is nothing that precludes it being good db design, but I can't even imagine a scenario where this would be OK).

In the 3rd generation programming languages this was just a simple structure with 100 entries.

Oh, you were the one who designed the 100-column table. Sorry..... Please go out and get some books on db design. You will thank me:-)

The data transfer from that structure generated a function call with more than 1000 parameters. This was to be mapped and re-mapped with each call to transfer data, this is even though the structure itself is static and determined at compile time.

IMO, your problem honestly is in the fact that you are using a monkey wrench as a ball peen hammer. It may sorta work but you are using the wrong tool for the job. If you want a simple object store use BDB or something like it. If you want a real data management solution, build your db *first.* If that is not your goal, use something other than an RDBMS.

The thing is such a table could be set up very easily and populated with a simple loop that rolls in the required values via say a mapping function with about 10 arguments. This could be done ONCE at run time to prepare for the transfer of data and then the same table could be referenced for each call and simply an address could be sent with the transfer.

Noooo.. It was dynamic and the data was encoded as parameters on the stack. This means the stack must be build and torn down and rebuilt for each call.

How is this an issue with RDBMS's?

Next - the implementation was so bad that the program would run in test mode with only a few parameter but it failed when the whole row was to be transfered.

Again, this is not a PostgreSQL problem;-)

I gave up on that interface.

From your description, that sounds like a wise choice.

While there is much good to say about RDBMS's in general. The issue I ran into was the interface from 3rd generation languages took a HUGE step backward. IMHO we should have a high level language statement called DBRead() and DBWrite(). In C this should generally correspond to fread() and fwrite(). If this is too complex then DBWriteStruct() could be implemented with suitable mapping helper function.

Again, this is an issue with the frameworks you are using. Personally, I tend to do the

Obviously, he's biased. But more importantly, he just said that column-store databases are going to take over the WAREHOUSE market. That doesn't mean that row-store databases are going to become obsolete, because there will always be applications out there that do a substantial amount of writing as well as reading.

In fact, the new wave of user-generated-content websites and webapps seems to me to indicate the exact opposite - if anything, row-store databases, with their usefulness in write-heavy applications, should becoming, if anything, more and more necessary/useful on the web.

So...chalk this one up to some grandstanding on the part of a guy who wants to put more money in his pockets...

"Column-oriented databases -- such as the one built by Stonebraker's latest start-up, Andover, Mass.-based Vertica Systems Inc. -- store data vertically in table columns rather than in successive rows. "

Relational databases will be around as long as humans generate relational data. Take the classic example of an invoice that may have many entries, each entry referencing an inventory item. This sort of thing is likely to exist forever, and RDBMSes model that pretty well.

As far as whether the backend is row- or column-oriented - who cares? As long as I can use the one most appropriate to my access pattern, the implementation details just don't interest me enough to get worked into a furor. Don't get me

Your invoice example is one that historically seems to only work with a purely relationship database, but I beg to differ because I've seen systems in place (custom coded) where an invoice front end was way more interactive in how you entered data. Searches were faster, and for large companies, the shoehorning to get data into place can be almost violent because they have such a variety of what they want to enter, or search for, or assemble by, etc.We have one customer, a large contractor, who is trying a

In my IT business, a vast majority of our top tier clients (grossing over US$100 million annually) are still using antiquated software that is still using a relational database backend. While these companies are generally VERY efficient in terms of providing services or products to their market, their accounting, purchase orders and project management software is decades outdated. Many of the companies that maintain these packages have merely made the interface more current (but still 5+ years old, but are still using terribly outdated software. I can't begin to tell you how often the words "FoxPro" and "MS SQL" come up and it ends up being a relational database "solution" or even worse.

It is very frustrating because we do have programmers on staff that create third party plug-ins to these databases to try to make solutions that the OEM code doesn't. When you meet younger programmers, many of them are frustrated themselves to work on ancient solutions that have no hope of being upgraded, because these industries we work in are not in a rush to try anything new and shiny, but instead are happy with the status quo.

I just bid a job a few months back that would cost $150,000 to upgrade their database infrastructure, and likely save the company $300,000+ annually in added efficiency, less downtime, and a more robust report system. Guess what they said? "We all think it is fine the way it is." That's money thrown out the window, employees who are frustrated (without knowing why), and forcing the company to lose efficiency by not being able to compete with newer companies that are utilizing newer technology to better their bottom line.

I've been in the banking industry for the past 6 years and every bank I've worked at has relied on text-only server side applications that we connect to via various terminal emulators. The workstations are all modern, but we don't use anything more taxing than excel and an e-mail app.

Why have none of them changed beyond a few interface bolt-ons? Well . . . one of them actually did once . . . and it wasn't pretty. Sure it was graphical and point-and-click and more "user friendly" in appearance. But the

I can't begin to tell you how often the words "FoxPro" and "MS SQL"You do know those aren't remotely comparable, right? FoxPro scales to more users than Access (due to tables separated into different files), but they're otherwise on a similar level in terms of what sort of jobs they're appropriate for. MS SQL Server is a full-fledged enterprise RDBMS. It may not scale quite as far as Oracle or DB2, but it get closer every generation, and having worked mostly in Oracle for the last year or so, I've been m

Don't be so gentle when you turn it, and make sure the DB server is running.. If you happen to do it just right, you can make enough hard drives in your raid array hiccup, and lose the whole array. Then it can read the entire DB in a millisecond!

Let me get this straight by paraphrasing: Column databases are the wave of the future, says a column database distributor on his new column database blog. And Red Hat would recommend you run your new column database on Red Hat Enterprise Linux, perhaps? I wonder what brand of kit Dell would recommend I run RHEL on...

Are they now officially an also-ran? Has the whole concept failed to be usefully implemented commercially, or will it be another Lisp--elegant, beautiful, and largely unused because it's kind of weird?

In a word, yes. I think there are a couple reasons for this:1. OR mappers like Hibernate have gotten to the point that they are quite good, so they make the value add prop of object databases less compelling.2. Object databases are never going to get the speed of relational databases. This is the real dealbreaker. Suppose an object database can handle 95% of my queries with adequate performance. All well and good, but I'm totally screwed on those other 5%. On the other hand, if I was using a relational data

For data warehousing, a higher or different level of abstraction may be useful and make database design easier, particularly as paralellism becomes more and more common. Storing rich markup language or media in a database might be problematic as well.

But there's no way that RDBMS's are going away -- relational algebra simply solves too many data storage problems.

Once someone shows that there is no longer a use for any relationship between data entries, then we'll be able to say that RDBMSs are obsolete. Actually both headlines (/. and the linked article) are mistaken about what Michael Stonebraker is saying. He is talking about read intensive applications mostly and he is talking about optimization of data for reading purposes. This does not mean that RDBMSs are obsolete for all uses, just that he sees a faster way to retrieve data for certain uses.

SenSage built a column-oriented DB in 2001 and has had much success with the approach for their fast-input, fast-query, high-density, multi-TB databases. Stonebreaker was on their technical advisory board. Interesting that he now centers his own startup on the same principles. See http://en.wikipedia.org/wiki/Column-oriented_DBMS [wikipedia.org].

I can understand people not reading every link on a slashdot article they comment on. But if you post the bloody link, is it too much asked to actually RTFA?! It's an article about a column. The actual column is quite interesting.

To add some content, this is about optimal storage for SQL databases in a data warehouse context where there are some interesting products that use something more optimal than the one size fits all solutions currently available from the big RDBMS vendors. The API on top is the same (i.e. SQL and other familiar dataware house APIs), which makes it quite easy to integrate.

Regarding the obsolescence question, one size fits all will be good enough for most for some time to come. Increasingly people are more than happy with lightweight options that are even less efficient on which they slap persistence layers that reduce performance even more just because it allows them to autogenerate all the code that deals with stuffing boring data in some storage. Not having to deal with that makes it irrelevant how the database works and allows you to focus on how you work with the data rather than worrying about tables, rows and ACID properties. Autogenerating code that interacts with the database allows you to do all sorts of interesting things in the generated code and the layers underneath. For example, the hibernate (a popular persistence layer for Java) people have been integrating Apache Lucene, a popular search index product, so that you can index and search your data objects using lucene search queries rather than sql. It's a quite neat solution that adds real value (e.g. fully text searchable product catalogs are dead easy with this).

Column based storage is just an optimization and not really that critical to the applications on top. If you need it, there are some specialized products currently. The author of the column is probably right about such solutions finding their way into mainstream products really soon. At the application level, you'll still be talking good old SQL to the damn thing though.

Maybe his approach is all wrong. The database my company uses has MANY tables that are rarely written to, but a few that are written to ALL the time. Instead of trying to cram his 'one size fits all' database scheme down our throats and replace the current 'one size fits all' database scheme, maybe he should be trying to create a database engine that can do both.

I think you would have to determine the main use of the table beforehand (write-seldom or write-often), but the DB engine could use a different scheme for each table that way. I know some will claim that it can't be more efficient to split things this way, but remember that this guy is claiming 50x the speed for write-seldom operations.

As for Relational Databases... How is this exclusive to that? This is simply how the data is stored and accessed. If he is claiming 50x speed-up because he doesn't deal with the relational stuff, that's bunk. You could write a row-store database with much greater speed as well, given those parameters.

'"In every major application area I can think of, it is possible to build a SQL DBMS engine with vertical market-specific internals that outperforms the 'one size fits all' engines by a factor of 50 or so," he wrote.'

I know very little about DBMS systems, but I thought it has always been true that you can achieve monumental performance increases by building somewhat specialized database systems in which the internals of the system make assumptions, and are tied to, the structure of the data being modelled. In fact, when RDBMS systems came in, one of the knocks on them was that they were far more resource-intensive than the hierarchical databases they displaced. However, the carved-in-stone assumptions of those models made them difficult and expensive change or repurposed.

I'm sure I remember innumerable articles insisting that "relational databases don't need to be really all that much terribly slower if you know how to optimize this that and the other thing..."

In other words, as an outsider viewing from a distance, I've assumed that the increasing use of RDBMS was an indication that in the real world it turned out that it was better to be slow, flexible, and general, than fast, rigid, and specialized.

So, what is a "column store?" It sounds like it is an agile, rapid development methodology for generating fast, rigid, specialized databases?

Living in Kentucky, I can tell you a relational database would be handy if actually used. Why I've got in-laws that, well, I won't go into details. Let's just say I suspect my wife married an out of stater for fresh genes.

1. Object-oriented databases are designed to work well with object-oriented programming languages such as Python, Java, C#, Visual Basic.NET, C++ and Smalltalk. This makes implimentation quick and easy - yet stable and scalable at the same time.

2. ODBMSs use exactly the same model as object-oriented programming languages.

3. It is also worth noting that object databases hold the record for the World's largest database (over 1000 Terabytes at Stanford Linear Accelerator Center).

4. Access to data can be faster because joins are often not needed (as in a tabular implementation of a relational database). This is because an object can be retrieved directly without a search, by following pointers (e.g. the objects are stored in trees for fast retrieval). Dynamic indexing schemes further speeds up retrieval of full text searches.

I think RDBMSs will be around for some time -- but they will be relegated to more structured situations and smaller data sets. ODBMSs will take over where data is changing, persistence is critical, data types are mostly large binary objects with associated meta-data, and datasets are humongous.

Right now my favorite ODBMS is the ZODB (Zope Object Data Base) [wikipedia.org] - an ODBMS system tightly integrated with both Python (implimented using Python's native 'pickle' object persistence functionality), and the Zope web application development system - which itself is built with and uses Python. You can learn more about Zope at Zope.org [zope.org].

The next geekSessions [geeksessions.com] is on this topic. If you're interested in hearing about alternatives to RDBMS in and using them in practice, take a look at the site. The event will be held on October 2nd in downtown SF and will also be available via webcast. In addition to the presentations, we'll have a Q&A session along with some food and FREE BEER.
Speakers at the event are:

Josh Berkus from the PostgreSQL core team

Paul Querna from Apache and Bloglines (wrote his own filesystem for Bloglines)

To be a relational database the database must meet a very specific set of requirements. While a standard view of the databases from the DB administrators and normal users view may allow limited ways to manipulate the data, looking at the actual storage structures and how to efficiently use the resultant files can provide some extremely efficient computational methods. For example, if one structures data in a relational database with few fields, perhaps as low as two, per representation where each representation represents a single file then the data read directly from the file, the reads may be via flat file techniques, into an array or efficient storage table, then you achieve the best of both worlds.
If the above example requires compressed data, then the relational data must be read from a single file via DB operations with two different mappings overlaying the single database file. The programmer must understand the data orientation so that compression/decompression occurs correctly. Extremely fast reads/writes may be achieved this way.
The advantage of a column oriented database is that the files are inherently optimized for data mining without the need to hire an expensive programmer. If the company had multiple requirements for the same data then multiple databases may be required, unless they are willing to hire the expensive programmer. A key problem results when multiple databases loose synchronization.
BTW, row based databases optimized for storage size (footprint) compress data by column, not by row or record. Using these techniques, I have achieved far greater data compression than comparable Google stored data.

Michael Stonebraker is certainly a well respected nae and he was been right on these issues in the past. Coinsidently I'd testing my software with a new version of PostgreSQL as I type.
I think colum vs. row storage can be considered simply a option. I can even see it being an option that you specify at the table level.
Most DBMS users really don't have much data. Today a 1,000,000 row
table can be cached in RAM on even a low-end PC based server. Once cached in RAM row vs. column storage does not matter. I would imagine that 99& of the database table in te world have far fewer then a million rows. This discussion applies only to the very few that are really large.

Will the imminent transition to SSDs make any difference? Because row-based DBs means you're typically reading large chunks (one raw) sequentially, while columns stores means you're reading many small (number of columns) chunks for every row. I'd think that if random access time was almost none, you'd get almost the same write performance while read performance could be greatly improved because you only read the columns you need. It'd certainly make DB design easier too if you didn't have to worry about putting very light information in the same table as heavy blobs.

The main problem is SQL is just a description language for set data, and a relational database is exactly that a set data.The main problem is so far nobody really has brought out something more reable to deal with sets in a mathematical sense, you could use mathematical operators but then things would become even less readable than SQL is.All approaches on the programming side I have seen (criteria objects etc...) make things only easier in some domains, after that you revert to plain sql and its derivates.