Google and the future of the RDBMS

I've been reading and re-reading this white hot
article about Google's technical infrastructure by Rich
Skentra. He makes a few key points, which I have felt free to
embellish and otherwise distort:

Google can afford to do -- and seems to be doing -- Bell
Labs-style "pure" CS R&D to actually expand upon and in some
cases challenge the 1970s technical state-of-the-art in storage,
OS and server architecture that for the most part has defined
most large-scale Web applications;

Google has staked its success on seemingly minor
incremental improvements, like highlighting your search terms
in the results summary, that nevertheless require orders of
magnitude increase in processing, memory or storage
power;

Google has constructed an impressive computing platform
to accomplish this, which is essentially one massive
computer;

The critical differentiator for this computer platform is its
ability to integrate hundreds of highly unreliable component
computer parts and thus cheaply add CPU cycles

"In a previous job I specified 40 moderately-priced servers to
run a new internet search site we were developing. The ops
team overrode me; they wanted 6 more expensive servers, since
they said it would be easier to manage 6 machines than 40.

"What this does is raise the cost of a CPU second. We had
engineers that could imagine algorithms that would give
marginally better search results, but if the algorithm was 10
times slower than the current code, ops would have to add 10X
the number of machines to the datacenter. If you've already got
$20 million invested in a modest collection of Suns, going 10X to
run some fancier code is not an option.

"Google has 100,000 servers. "

Google's architecture of many cheap computer boxes
running free operating systems seems to be both an extension
of trends that were percolating prior to Google (recall the old
Slashdot saw "imagine a Beowulf cluster of these ..." and, if you
were at the Windows 2000 launch, recall the conference hall full
of Dells we were told would soon be powering Hotmail) and at
the same time a very visible and clear advantage for the
company over its competitors. I'm sure there are system
architects looking at what Google has done and thinking of trying
something similar (as opposed to buying another Sun or HP
box).

My question is: Can the RDBMS as we know it today extend
adequately to be used in Web applications distributed across
hundreds or thousands of individual CPUs? You write in the
Internet application workbook:

"It turns out that the CPU-CPU bandwidth
available on typical high-end servers circa 2002 is 100
Gbits/second, which is 100 times faster than the fastest
available Gigabit Ethernet, FireWire, and other inexpensive
machine-to-machine interconnection technologies.

"Bottom line: if you need more than 1 CPU to run the RDBMS,
it usually makes most sense to buy all the CPUs in one physical
box.

If people start building more Google-style backends, will they
be able to use traditional RDBMSs or will they, like Google, have
to start rethinking the viability of this 1970s technology?

Or, to invert the question, will Google be able to use its
platform as it transitions from infrequently updated datasets
(Web index, Usenet archive) and occasionally-updated datasets
(news aggregation, shopping bot) to datasets that are both
complex and mutating in real-time (e
mail, social networking systems, weblogs)? If so, do you
have any clue how Google might create an RDBMS-style system
for its platform?

Answers

A profound question. The very first system to look at links among Web sites was developed by Ellen Spertus, then a graduate student at MIT but doing her research at University of Washington because nobody at MIT understood what an RDBMS was. She made it possible to use SQL to ask questions such as "show me all the sites that link to http://www.photo.net" or "show me all the sites that are linked to by at least 10 other sites". So in some sense Google has a heritage in an RDBMS-based system.

The RDBMS was developed for precious data. Google and similar massive Internet systems deal in non-precious data. Google need not care if they lose thousands of updates from an evening's crawling or if some of their servers are several updates behind and give slightly different results than their most up-to-date server. So they're free to do a lot of stuff that people building a bank transaction processing system aren't free to do.

The RDBMS is also all about making it easy and reasonably efficient to ask new and unanticipated questions. Most Web applications have a very constrained interface and therefore a very limited number of questions that can be asked. So again it would be very wasteful to use an RDBMS in a performance-critical server farm such as Google.

The RDBMS is all about making sure that average quality programmers on tight schedules don't make terrible mistakes in managing concurrency. An organization with brilliant programmers and longer development schedules might be able to manage concurrent updates at a much lower cost in performance and hardware.

It might be a mistake to look at the most challenging IT problems as generic examples. If you said "I'm not going to build an accounting system unless it can solve all the problems faced by General Motors" you'd never build QuickBooks. It would also be a mistake for most people to say "My computation problems are tough so I want to get the same setup as those IBM genetic researchers or Google."