Embed this text on your site

Share this page

The story of how the cloud-king turned its back on Rails
and swindled its customers

A Rails dyno isn't what it used to be. In mid-2010, Heroku quietly redesigned its routing system, and the change — nowhere documented, nowhere instrumented — radically degraded throughput on the platform. Dollar for dollar a dyno became worth a fraction of its former self.

Of course for the tech side of the house, this means we’re finally running into those problems that everyone says they want to run into. We are finally hitting that point where our optimizations aren’t premature. With nearly 15 million monthly uniques we are, as they say, “at scale.”

As exciting as that is it’s also frightening, and in this moment more than others it’s great to have a strong technical partner — like Heroku, the hosting service that makes ops as easy as playing with sliders:

If you had asked us a couple of weeks ago, we would have told you that we were happy to be one of Heroku’s largest customers, happy even to be paying their eye-popping monthly bill (~$20,000). “As devs,” we would have said, “we don’t want to manage infrastructure, we want to build features. If Heroku lets us do that, they’ve earned their keep.”

Ten days ago, spurred by a minor problem serving our compiled javascript, we started running a lot of ab benchmarks. We noticed that the numbers we were getting were consistently worse than the numbers reported to us by Heroku and their analytics partner New Relic. For a static copyright page, for instance, Heroku reported an average response time of 40ms; our tools said 6330ms. What could account for such a big difference?

“Requests are waiting in a queue at the dyno level,” a Heroku engineer told us, “then being served quickly (thus the Rails logs appear fast), but the overall time is slower because of the wait in the queue.”

How you probably think Heroku works

To understand how weird this response was, you first must understand how everyone thinks Heroku works.

When you deploy an app to Heroku, you actually deploy it to a bunch of different “dynos” (virtualized Ubuntu servers) that live on AWS. For a Rails app, each dyno is capable of serving one request at a time. They each cost $36 per month, or $79.20 per month if you buy the New Relic add-on.

When someone requests a page from your site, that request first goes through Heroku’s router (they call it the “routing mesh”), which decides which dyno should work on it. The ostensible purpose of the router is to balance load intelligently between dynos, so that a single dyno doesn’t end up working non-stop while the others do nothing. If at any given moment all the dynos are busy, the router should queue the request and give it to the first one that becomes available.

Intelligent routing: The routing mesh tracks the availability of each dyno and balances load accordingly. Requests are routed to a dyno only once it becomes available. If a dyno is tied up due to a long-running request, the request is routed to another dyno instead of piling up on the unavailable dyno’s backlog.

The heroku.com stack only supports single threaded requests. Even if your application were to fork and support handling multiple requests at once, the routing mesh will never serve more than a single request to a dyno at a time.

The Heroku log format doesn't even include an entry for time spent in the in-dyno queue, because the assumption is that such a queue does not exist. The entries that are included are for the router queue:

"Queuing at the Dyno Level"

This is why the Heroku engineer's comment about requests “waiting in a queue at the dyno level” struck us as so bizarre — we were under the impression that this could never happen. The whole point of "intelligent load distribution as you scale" is that you shouldn't send requests to dynos unless they're free! And even if all the dynos are full, it's better for the router to hold on to requests until one frees up (rather than risk stacking them behind slow requests).

If you're lucky enough to find the correct doc — a doc that contradicts all the others, and the logs, and the marketing material — you'll find that Heroku replaced its "intelligent load distribution," once a cornerstone of its platform, with "random load distribution":

In mid-2010, Heroku redesigned its routing mesh so that new requests would be routed, not to the first available dyno, but randomly, regardless of whether a request was in progress at the destination.

That decision was not announced. The bulk of Heroku's documentation explicitly says, or implicitly assumes, the opposite. “Time spent in the dyno queue” is nowhere reported in their logs, and nowhere exposed by their (very expensive) analytics partner New Relic. And, crucially, this change didn't affect their prices — Heroku has charged $36 per month per dyno since launch.

So what?

It would be like if those machines at the Whole Foods checkout line didn’t send you to the first available register, but to a random register where other customers were already standing in line. How much longer would it take to get out of the store? How much more time would the checkout clerks spend idling? If you owned that store and one day the manager, without telling you, replaced your fancy checkout routing system with a pair of dice, and his nightly reports to you never changed — he never told you how long people were waiting at individual registers, that they even could (wasn’t preventing that the whole point of having a routing system?) — that would be bad, right?

In the old regime, which Heroku called “intelligent routing,” a dyno was a dyno was a dyno. When you bought one, you bought a predictable increase in concurrency (the capacity for simultaneous requests). In fact Heroku defines concurrency as "exactly equal to the number of dynos you have running for your app."

But that's no longer true, because the routing system is no longer intelligent. When you route requests randomly — we’ll call this the “naive” approach — concurrency can be significantly less than the number of dynos. That’s because unused dynos only have some probability of seeing a request, and that probability decreases as the number of dynos grows. It’s no longer possible to reliably “soak up” excess load with fresh dynos, because you have no guarantee that requests will find them.

Clearly, under Heroku's random routing approach you need more dynos to achieve the same throughput you had when they routed requests intelligently. But how many more dynos do you need? If your app needed 10 dynos under the old regime, how many does it need under the new regime? 20? If so, Heroku is overcharging you by a factor of 2, which you might playfully refer to as the Heroku Swindle Factor™.

Intuitively, how much worse do you think random routing is? What's the true value of the HSF™? 2? 5? TEN?!

Below you can see a minute's worth of the simulation. The first animation shows what happens in a world with naive routing. Notice that as time goes on, requests pile up on individual dynos, each dyno represented by a bar that's as high as its current queue of requests.

Now let's turn on intelligent routing, holding the other parameters in the simulation constant. Watch what happens. The bars never grow, because dynos never see more than one request at a time. Requests respond as quickly as Rails can process them:

Here are our final aggregated results:

If Heroku were using intelligent routing, an app with 75 dynos that receives 9,000 requests per minute will never have to queue a request. But with a naive (random) router, that same app — with the same number of dynos, the same rate of incoming requests, the same distribution of response times — will now see a 62% queue rate, with a mean queue time of 2.763 seconds. On average each request will spend almost 6x longer in queue than in the app.

And since each additional dyno adds less and less to your app's concurrency (since it's less and less likely to get used), you have to add a lot of dynos to get the queue rate down. In fact to cut your percentage of queued requests by half, you have to double your allotment of dynos. And even as you do that, the average amount of time that queued requests spend in the queue (column 4) stubbornly holds above 1s.

But of course you can’t actually crank your app to 4,000 dynos. For one thing it’d cost over $300k per month. For another, Postgres can’t handle that many simultaneous connections.

So the only solution is for Heroku to return to routing requests intelligently. They claim that this is hard for them to scale, and that it complicates things for more “modern” concurrent apps like those built with Node.js and Tornado. But Rails is and always has been Heroku’s bread and butter, and Rails isn’t multi-threaded.

What is this?

The Genius annotation is the work of the Genius Editorial project. Our editors and contributors collaborate to create the most interesting and informative explanation of any line of text. It’s also a work in progress, so leave a suggestion if this or any annotation is missing something.