Tuesday, June 27, 2006

Four petabytes in memory

One way to look at how the Google cluster changes the game is to look at how data access speeds change with a cluster of this size.

Google reportedly had an estimated 450k machines two months ago and adds machines at roughly 100k per quarter. In 2004, each of these machines had 2-4G of memory, and, two years later, likely are up to 8G standard.

That means that Google can store roughly 500k * 8G = 4 petabytes of data in memory on their cluster.

Four petabytes. How much is that?

It is twice the size of the entire Internet Archive, a historical copy of much of the Web. It is the same as the estimated size of all the data Google had in 2003 on disk.

It is a staggering amount of data. And it is all accessible at speeds orders of magnitude faster than those with punier clusters.

8 comments:

RichB
said...

I believe Microsoft/MSN/Live were talking a couple of months ago about why they had a competitive advantage by starting later and having a completely 64bit setup before anyone else. How could Google have 8Gig of RAM if they didn't have 64bit? Were Microsoft wrong?

Being able to access vast amounts of data orders of magnitude faster allows doing things that couldn't be done before. Analyses that were impossible before because they would take years can now be done in days. Online features that were impossible before because they would have taken minutes to respond can now be launched.

On your other point, I don't think this is in opposition to disk-oriented tools. Being able to store vast amounts of memory means you can operate much more quickly over frequently or repeatedly accessed data and data for online features with short access time requirements. Infrequently accessed data is still going to be on disk.

Actually, while GFS manages disks, it also uses any available server memory as a large disk cache - well, it is really Linux doing that, but net-net that's it. With locality of reference, they probably get a major benefit.

As for MS saying they have a late adopter advantage - isn't it odd that a software company would hang their hat on a hardware advantage? GFS is a major SW win for Google, and not easily or quickly replicated. See more on GFS at http://storagemojo.com/?p=88

In fact it does not even make sense, because a 32-bit-mainboard can't in and case take more than 4GB of ram. This means that to add more ram, they have to exchange the mainboard and cpu too. And this certanly does not make sense, because then you get a new system anyway.

So i'm thinking, this is where the 100k new machines per quarter come from: upgrading/replacing old machines, and even if not, you're adding additional 4GB per system that are impossibe to get added.