Happily upgrading Ruby on Rails at production scale

The Envato marketplace sites recently upgraded from Rails 2.3 to Rails 3.2. We
did this incrementally at full production scale, while handling 8000 requests
per minute, with no outages or problems. The techniques we’ve developed
even let us seamlessly and safely experiment with mixing Rails 4 servers into
our production stack in the near future.

We wanted to be able to confidently make the huge version jump without having
to do an all-or-nothing cutover. In order to achieve this we made a number of
modifications that allowed us to run Rails 3.2 servers side-by-side on our load
balancer with all of our 2.3 servers. This let us build confidence in our
upgrade to the new version gradually with far lower risk of our users receiving
a bad experience.

If you are still stuck back on a Rails 2.3 app, this should help kick start your
upgrade progress to Rails 3 (and beyond to 4 if you’re ready).

This post will go into the technical details around making this upgrade as
smooth as it was.

History

The Envato Marketplaces are actually quite an old code base, started back in
February 2006 on Rails 1.0. It has seen it’s fair share of hairy Rails upgrades
over the years, but the changes in the framework between Rails 2.3 and 3.0 were
pretty much a rewrite. For a long time it felt like we’d accrued too much
technical debt from previous upgrades to make the jump without giving the code
base a lot of love first.

We were stuck on 2.3 quite a while, and we let 3.0, 3.1 and a large number of
3.2 releases slip by. Eventually the pain from being on such an old unsupported
version of Rails became enough that we were able to get the approval to pay
back some of our technical debt and do some longer overdue upgrades.

One of the things we did to build up our own confidence, and the confidence of
the business, in our upgrade to Rails 3.2 was to create some patches both to
Rails 2.3 and to Rails 3.2 that would allow us to have requests bounce between
servers of both versions seamlessly. This let us run a single Rails 3 server
amongst the many other Rails 2 servers for short bursts, so that even if the
Rails 3 server failed catastrophically, it would still only cause a small
percentage of user-visible failures, and would allow us to test for performance
and reliability without having to do an all-in switch at the first smoke test.

Problem 1 - Changes to SessionHash

The SessionHash is where details are stored that let us identify who a user is
logged in as, so we can determine what parts of the site they can access, show
them their own profile settings, etc.

In Rails 2, the SessionHash
(source)
was much closer to a plain old hash, meaning if you store something in
session[:foo] and try to access it via session["foo"], you’d get nil. On
Rails 3, the SessionHash (provided by Rack,
source)
started acting like a HashWithIndifferentAccess so both session["foo"] and
session[:foo] would return the same thing.

This meant that taking a session from Rails 2 to 3 worked correctly, the newer
version of Rails didn’t care if we stored the session_id as a string or a
symbol, it could happily find it and we’d end up with the same session_id we
had on Rails 2.

The problem was that once a session goes through Rails 3, all of the keys to
the SessionHash are then stored as strings, meaning if we take the session back
to Rails 2 and it looks for session[:session_id], we get nil and a new
session_id is generated, ignoring the old session_id and logging out the user.

This is obviously not desirable as it means any user who hits our Rails 3 test
server is very likely going to be logged out on their next request. This is
because our load balancer is not configured to do “sticky sessions” where a
user’s requests will always hit the same server, and consequently the next
request will probably hit a Rails 2 server.

To deal with this, we wrote a patch for ActionController::Session on Rails
2.3
which approximated the behaviour of Rails 3 closely enough that requests could
bounce back and forth between versions and the user remained logged in with
the same session_id.

The basic idea is that we store everything in the SessionHash as strings, and
then when looking things up we seek using a string key first, fall back to
seeking with a symbol key if it’s not found (ie. If it’s a SessionHash from a
vanilla Rails 2.3 server).

Problem 2 - Changes to How Flash Messages are Stored

In Rails, there is the concept of a flash, which is a short term storage place for
putting messages/errors/etc that will be displayed for a user on their next
page view. An example of such is when you edit your profile and it
successfully saves, a message is passed on to the next page via the flash,
which is then displayed to say everything went according to plan.

The class used to do this is marshalled into a binary format in the session,
and then unmarshalled on the next request. The problem that occurred here was
that the class used for flashes completely changed between Rails 2 and 3,
meaning that if you attempted to unmarshall a Rails 2 session with a flash
object stored in it on a Rails 3 server (or vice versa), the request would blow
up with a ActionDispatch::Session::SessionRestoreError, complaining that an
object of a class was found, but the class isn’t defined anywhere.

In Rails 4 pre-release builds, this practice has stopped and the flash is now
stored in the session as a basic Ruby hash, meaning it’ll happily unmarshall
on any version of Rails, even if the version of Rails doesn’t know how to get
the flash message out from that data structure.

We think this is a much better approach, and thus we back-ported this new
method of flash serialization back to both Rails
2.3
and Rails
3.2

We also got some hacks working that let us unmarshall the missing flash class
on each Rails version even without a full class definition, meaning we could
use our knowledge of how things were stored internally in those objects and
#instance_variable_get to pull out the messages and bring them into the new
format.

With these patches in place on both our Rails 2 servers and our Rails 3 test
server, it was possible for a user to bounce between servers of different
versions without being logged out, and without seeing error pages because the
current server couldn’t understand the flash message from the previous server.
It would theoretically even be possible to add a Rails 4 box to the pool and
have the session be happily understood on any server in the pool.

Problem 3 - Consistent URLs Between Versions

The final hurdle in running Rails 2 and 3 concurrently in production was making
sure all our URLs remained the same between Rails 2 and 3. The format for the
routes file completely changed between these versions and we took this as an
opportunity to kill off a lot of overly-permissive routes that let through too
many HTTP verbs, many dating back to a time before Rails even spoke REST.

The initial work on this involved lots of manual testing to ensure URLs and form
actions were still matching up. Once we had fairly high confidence that most
things aligned, we developed some time charts in our log aggregation tool
Splunk which allowed us to see when a request came
through and was unable to be routed to a particular controller. We always see a
level of background noise as we gets LOTS of requests for random php files,
etc. Some of these requests are obviously malicious, some just innocently bad
URLs, but by graphing them on a time chart we are able to determine what is
normal background noise and what are new routing problems caused by Rails 3.

Rollout

With these patches in place and the Splunk charts at our disposal, we were in a
very safe position to silently start serving requests on Rails 3 in short
bursts, and get an even better level of confidence that all our URLs matches up
correctly. Initially we added one Rails 3 server to the load balancer for a 1
minute smoke test, which revealed very few problems, just a few unusual routes
we’d missed surrounding the API. These were fixed and we did progressively
longer tests, each time fixing any problems that were revealed, and eventually
we could have a Rails 3 server in rotation for 30+ minutes with no obvious
change in error rate.

Conclusion

The extra work involved in getting things into a position where we could run a
Rails 2 and Rails 3 version of our app concurrently was definitely worthwhile.
It allowed us to detect more problems without the site appearing broken to all
users than we otherwise could have. This gave us a huge degree of confidence that
the final cutover would go smoothly and it allowed us to properly assess the
performance of our Rails 3 app with production levels of traffic without having
to “bet the farm” so to speak.

I ended up being the on-call person for the first night after the full
cut-over, but because of all the work we’d put in beforehand to make sure
things went smoothly, the night was so quiet that I started to worry if my
phone was actually working.

You would expect with an upgrade this big that even when you think of everything
there’ll still be something that slips through, but the things that did were
decided minor enough that I got an uninterrupted nights sleep after what was
probably our most high-risk upgrade to date.