Search This Blog

Loading...

Tuesday, March 12, 2013

How We Went from 30 Servers to 2: Go

When we built the first version of IronWorker, about 3 years ago, it was written in Ruby and the API was built on Rails. It didn’t take long for us to start getting some pretty heavy load and we quickly reached the limits of our Ruby setup. Long story short, we switched to Go. For the long story, keep reading, here's how things went down.

The Original Setup

First, a little background: we built the first version of IronWorker, originally called SimpleWorker (clever name right?), with Ruby. We were a consulting company building apps for other companies and there were two really hot things at that time: Amazon Web Services and Ruby on Rails. So we built apps using Ruby on Rails on AWS and we got some great clients. The reason we built IronWorker was to scratch our own itch. We had several clients building hardware devices that constantly sent in data, 24/7, and we had to collect it and process it into something useful. Typically scheduling big jobs to run through the data every hour, then every day and so on. We decided to build something we could use for all our clients without having to manage a bunch of infrastructure for each of them to process this data. So we built "workers as a service" that we used internally for a while, then we thought there must be other people that need this so we decided to make it public and thus IronWorker was born.

Our sustained CPU usage on our servers was approximately 50-60%. When it increased a bit, we’d add more servers to keep it around 50%. This was fine as long as we didn’t mind paying for a lot of servers (we did mind). The bigger problem was dealing with big traffic spikes. When a big spike in traffic came in, it created a domino effect that would take take down our entire cluster. At some threshold above 50%, our Rails servers would spike up to 100% CPU usage and become unresponsive. This would in turn cause the load balancer to think it failed and take it out of the pool, thereby applying the load that the unresponsive server would have been handling to the remaining servers. And since the remaining servers are now handling the load of the lost server plus the spike, inevitably a second server would go down, the load balancer would take it out of the pool and so on. Pretty soon every server in the pool is toast. This phenomenon is otherwise known as a colossal clusterf**k (ref: +Blake Mizerany).

Here's a simple drawing of the domino failure effect.

The only way to avoid this with the setup we had was to have a ton of extra capacity to keep our server utilization much lower than what it was, but that meant spending a ton of money. Something had to change.

We Rewrote It

We decided to rewrite the API. This was an easy decision, clearly our Ruby on Rails API wasn't going to scale well and coming from many years of Java development and having written a bunch of things that handled tons of load with way less resources than what this Ruby on Rails setup could handle, I knew we could do a lot better. So then the decision came down to which language to use.

Choosing a Language

I was open to new ideas since the last thing I wanted to do was go back to Java. Java is (was?) a great language in a lot of ways such as performance, but after writing Ruby code for a few years, I was hooked on how productive I could be. Ruby is fun, plain and simple.

We looked at other scripting languages with better performance than Ruby (which wasn't hard) like Python and JavaScript/Node, we looked at Java derivatives like Scala and Clojure, and other languages like Erlang (which apparently AWS uses) and Go (golang). Go won. The fact that concurrency was such a fundamental part of the language was huge; the standard core library had almost everything we needed to build an API service; it's terse; it compiles fast; like Ruby, Go is fun; and finally, the numbers don't lie. After some prototyping and performance testing, we knew we could push some serious load through it. With some convincing of the team ("It's all good, Google is backing it"), we bit the bullet.

When we first decided on Go, it was a risky decision. There wasn't a big community, there wasn't a lot of open source projects, there weren't many (if any) success stories of production usage. We also weren't sure if we would be able hire top talent if we chose Go, but we soon found out that we could get top talent because we chose Go. We were one of the first companies to publicly say that we were using it in production and the first company to post a Go job to the golang mailing list. Top tier developers wanted to work for us just so they could use Go in their day to day lives.

After Go

After we rolled out our Go version, we reduced our server count to two and we really only had two for redundancy. They were barely utilized, it was almost as if nothing was running on them. Our CPU utilization was less than 5% and the entire process started up with only a few hundred KB's of memory (on startup) vs our Rails apps which were ~50MB (on startup). Compare that even to JVM memory usage! It was night and day. And we've never had another colossal clusterf**k since.

We have grown a lot since those early days. We now get way more traffic, we added two more services (IronMQ and IronCache), and we run 100's of servers to handle our customer's needs. And it's all powered by Go on the backend. In retrospect, it was a great decision to choose Go as it's allowed us to build great products, to grow and scale, and attract grade A talent. And I believe it will continue to help us grow for the foreseeable future.

UPDATE: We posted another article on our experiences with Go called Go After 2 Years in Production. In it, we address issues of performance, memory, concurrency, reliability, deployment, and talent acquisition. As with above, this article also generated a large thread of comments at Hacker News.