Thursday, February 02, 2006

Early Amazon: Splitting the website

Now that commodity Linux servers are so commonplace, it is easy to forget that, during the mid and late 1990's, there was a lively debate between whether sites should scale up using massive, mainframe-like servers (Sun's Star Fire being a good example) or scale out using cheap hardware (e.g. Intel desktop PCs running Linux).

When I joined Amazon in early 1997, there was one massive database and one massive webserver. Big iron.

As Amazon grew, this situation became less and less desirable. Not only was it costly to scale up big iron, but we didn't like having a beefy box as a single point of failure. Amazon needed to move to a web cluster.

Just a few months after I started at Amazon, I was part of a team of two responsible for the website split. I was working with an extremely talented developer named Bob.

Bob was a madman. There was nothing that seemed to stop him. Problem on the website no one else could debug? He'd attach to a live obidos process and figure it out in seconds. Database inexplicably hanging? He'd start running process traces on the live Oracle database and uncover bugs in Oracle's code. Crazy. He seemed to know no limits, nothing that he wouldn't attack and figure out. Working with him was inspirational.

Bob and I were going to take Amazon from one webserver to multiple webservers. This was more difficult than it sounds. There were dependencies in the tools and in obidos that assumed one webserver. Some systems even directly accessed the data stored on the webserver. In fact, there were so many dependencies in the code base that, just to get this done in any reasonable amount of time, it was necessary to maintain backward compatibility as much as possible.

We designed a rough architecture for the system. There would be two staging servers, development and master, and then a fleet of online webservers. The staging servers were largely designed for backward compatibility. Developers would share data with development when creating new website features. Customer service, QA, and tools would share data with master. This had the added advantage of making master a last wall of defense where new code and data would be tested before it hit online.

Read-only data would be pushed out through this pipeline. Logs would be pulled off the online servers. For backward compatibility with log processing tools, logs would be merged so they looked like they came from one webserver and then put on a fileserver.

Stepping out for a second, this is a point where we really would have liked to have a robust, clustered, replicated, distributed file system. That would have been perfect for read-only data used by the webservers.

NFS isn't even close to this. It isn't clustered or replicated. It freezes all clients when the NFS server goes down. Ugly. Options that are closer to what we wanted, like CODA, were (and still are) in the research stage.

Without a reliable distributed file system, we were down to manually giving each webserver a local copy of the read-only data. Again, existing tools failed us. We wanted a system that was extremely fast and would do versioning and rollback. Tools like rdist were not sufficient.

So we wrote it ourselves. Under enormous time pressure. The current big iron was melting under "get big fast" load, a situation that was about to get much worse as Christmas approached. We needed this done and done yesterday.

We got it done in time. We got Amazon running on a fleet of four webservers. It all worked. Tools continued to function. Developers even got personal websites, courtesy of Bob, that made testing and debugging much easier than before.

Well, it sort of just worked. I am embarrassed to admit that some parts of the system started to break down more quickly than I had hoped. My poor sysadmin knowledge bit me badly as the push and pull tools I wrote failed in ways a more seasoned geek would have caught. Worse, the system was not nearly robust enough to unexpected outages in the network and machines, partially because we were not able to integrate the system with the load balancer (an early version of Cisco Local Director) that we were using to take machines automatically in and out of service.

But it did work. Amazon never would have been able to handle 1997 holiday traffic without what we did. I am proud to have been a part of it.

Of course, all of this is obsolete. Amazon switched to Linux (as did others), substantially increasing the number of webservers, and eventually Amazon switched to a deep services-based architecture. Needs changed, and the current Amazon web cluster bares little resemblance to what I just described.

But, when building Findory's webserver cluster, I again found myself wanting a reliable, clustered, distributed file system, and again found the options lacking. I again wanted tools for fast replication of files with versioning and rollback, and again found those missing. As I looked to solve those problems, the feeling of deja vu was unshakeable.

Gregg,I was wondering how you've solved the clustered file system and grid computing issues at Findory from a 10,000 foot view. I have a number of apps I'd like to write that need such features. Unless you work at Google we all share the same pain =). Nutch on the Java side of things looks interesting but it needs more support.

Findory uses an architecture roughly similar to what I described for Amazon in 1997 but quite a bit simpler.

As for other options, I think AFS might work if you don't need RW replication. The new MySQL cluster product looks really good. It might work if you really need a database, not a file system, and you have the budget for a larger number of machines (because all data is kept in memory). I think CODA might be an option if you're willing to use it on a production system (which they recommend against).

As you said, it's not a file system. From their site, "you don't run regular Unix applications or databases against MogileFS. It's meant for archiving write-once files and doing only sequential reads."

But, if those constraints are fine for your needs, it might be a good option.

As the personwho had to maintain and update the website management code after Greg moved on to more fun tasks I can personally attest to the limits of Greg's sysadmin knowledge,

Seriously, amazon's web site operations were so "high-end" that there wasn't anything they could buy (no COTS solutions) that would handle the situations amazon was encountering. From the CMS to the web site management tools, to inventory and customer databases, it was all (at least originally) written in-house out of necessity. Sure, it ran on top of hardware, OSes and database systems built by others, but that was about it.

I'll bet this is also true of places like eBay or any other large ecommerce stores.

This is also very true of "the world's largest financial exchange" (I'll let you google to find out who). They are at the limits of Oracle running on the biggest Sun box you can buy. And since it's financial you absolutely require minimal caching and ACID properties. The fact it's an exchange also means it's extremely difficult to scale out the db - the user account balance table is checked at least once for every single write operation (of which there are 300/second).

It is a high performance distributed POSIX compliant file system (as opposed to a mere cluster FS with Redhat GFS). Lustre is open source, with support and premium packages from Cluster FS. In short it provides a distributed RAID 0.

It supports some levels of redundancy, at the server level. In early 2007 their roadmap is support a distributed RAID 5. Check out the RoadMap on the Cluster FS site. It's the closest thing I've found to the Google File System and its more general purpose. Has anyone else had experience with it?

We spent quite a bit of time looking for a Distributed File System for our large web application. I liked MogileFS, but ran into problems with maintaining backwards compatibility (since its not a mountable file system) and scaling that happens when you try to put more than 500 million objects in it.

We just signed a contract with Isilon systems for a 150tb usable distributed filesystem. We were able to haggle them down to a reasonable price (about 20% more of what I estimated Mogile hardware would cost).

I'm excited about Isilon. They've got a great product. Their support contract is incredible expensive, and thats the part we grilled on them to reduce cost.

I think Akamai used Netapp to solve these kind of problems, but even that is unneccessary as free open source solutions are available: ZFS on the free and open source OpenSolaris operating system now integrates with Lustre 3.0, so you can use Lustre as your distributed file system and have ZFS as the back end of it (for block level checksumming and more reliability) so you don't need the insane amount of duplication of data that Google has in their Google File System or the insane Netapp bill that Akamai pays. Looks like Sun has solved everybody's problems and open sourced the code, but perhaps they are too late?