Sunday, March 14, 2010

GFS and its evolution

A fascinating article, "GFS: Evolution on Fast-Forward", in the latest CACM magazine interviews Googler Sean Quinlan and exposes the problems Google has had with the legendary Google File System as the company has grown.

Some key excerpts:

The decision to build the original GFS around [a] single master really helped get something out the door ... more rapidly .... [But] problems started to occur ... going from a few hundred terabytes up to petabytes, and then up to tens of petabytes ... [because of] the amount of metadata the master had to maintain.

[Also] when you have thousands of clients all talking to the same master at the same time ... the average client isn't able to command all that many operations per second. There are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests and the master would be under a fair amount of duress.

64MB [was] the standard chunk size ... As the application mix changed over time, however, ways had to be found to let the system deal efficiently with large numbers of files [of] far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all those [small] files made on the centralized master .... There are only a finite number of files you can accommodate before the master runs out of memory.

Many times, the most natural design for some application just wouldn't fit into GFS -- even though at first glance you would think the file count would be perfectly acceptable, it would turn out to be a problem .... BigTable ... [is] one potential remedy ... [but] I'd say that the people who have been using BigTable purely to deal with the file-count problem probably haven't been terribly happy.

The GFS design model from the get-go was all about achieving throughput, not about the latency at which it might be achieved .... Generally speaking, a hiccup on the order of a minute over the course of an hour-long batch job doesn't really show up. If you are working on Gmail, however, and you're trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up. We had similar issues with our master failover. Initially, GFS had no provision for automatic master failure. It was a manual process ... Our initial [automated] master-failover implementation required on the order of minutes .... Trying to build an interactive database on top of a file system designed from the start to support more batch-oriented operations has certainly proved to be a pain point.

They basically try to hide that latency since they know the system underneath isn't really all that great. The guys who built Gmail went to a multihomed model, so if one instance of your GMail account got stuck, you would basically just get moved to another data center ... That capability was needed ... [both] to ensure availability ... [and] to hide the GFS [latency] problems.

The model in GFS is that the client just continues to push the write until it succeeds. If the client ends up crashing in the middle of an operation, things are left in a bit of an indeterminate state ... RecordAppend does not offer any replay protection either. You could end up getting the data multiple times in a file. There were even situations where you could get the data in a different order ... [and then] discover the records in different orders depending on which chunks you happened to be reading ... At the time, it must have seemed like a good idea, but in retrospect I think the consensus is that it proved to be more painful than it was worth. It just doesn't meet the expectations people have of a file system, so they end up getting surprised. Then they had to figure out work-arounds.

Interesting to see exposed the warts of Google File System and Bigtable. I remember when reading the Bigtable paper being surprised that it was layered on top of GFS. Those early decisions to use a file system designed for logs and batch processing of logs as the foundation for Google's interactive databases appear to have caused a lot of pain and workarounds over the years.

On a related topic, a recent paper out of Google, "Using a Market Economy to Provision Compute Resources Across Planet-wide Clusters" (PDF), looks at another problem Google is having, prioritizing all the MapReduce batch jobs at Google and trying to maximize utilization of their cluster. The paper only describes a test of one promising solution, auctioning off the cluster time to incent developers to move their jobs to non-peak times and idle compute resources, but still an interesting read.

Update: A year later, rumor has it that a new version of GFS (called Colossus, aka GFS2) resolves the problems with Bigtable I described here. Quoting: "[Bigtable] is [now] underpinned by Colossus, the distributed storage platform also known as GFS2. The original Google File System ... didn't scale as well as the company would like. Colossus is specifically designed for BigTable, and for this reason it's not as suited to 'general use' as GFS was."