Business Communication Simplified

How to Upgrade a Legacy Heroku Database

The number of concurrent users on our service has been growing by leaps and bounds. This is great news for our business but also means our primary database on Heroku keeps running out of cache. In our experience, Heroku’s Postgres databases see a significant performance drop when cache utilization exceeds 85%.

Ten days ago our growth caught up with us and our app began to slow. Engineering isolated the cause of the slowdown to the database, and our cache utilization. We then upgraded to the next plan (Fugu) following the fast database changeovers procedure. The process required less than 90 seconds of downtime. The team rejoiced as we believed we’d found a relatively easy way to keep scaling Postgres, without application changes, for the near future.

Five days ago I checked our cache utilization. We were already at 2.3GB/3.75GB-. The data size was growing at more than 100MB per day. This meant that in less than 8 days we would already have performance issues again. I logged in to Heroku to initiate another fast-database-changeover, but to my surprise all further upgrades were grayed out (see image below).

After contacting support, I learned that SendHub was one of the very first clients to start using Heroku’s managed database services back in January 2012. It turns out that all early adopter Postgres instances are running on 32-bit Postgres, and we were already on the maximum plan available (Fugu). There is a binary incompatibility between the two architectures which renders the WALs (Write-Ahead-Logs) not interoperable, which means that a 32-bit database cannot have 64-bit followers. The end result was that a fast-database-changeover procedure was impossible. Heroku advised doing a “dump and restore” operation but this would mean 45 minutes or more of downtime. Even in the middle of the night SendHub has significant activity, making that option unacceptable.

The SendHub stack is homogeneous and straightforward, consisting of Python (Django/Celery), Postgres and RabbitMQ. All the database interactions use single statement autocommit (no multi-statement transactions). Since multiple concurrent sessions of transactions do not exist on our system, I thought that maybe the following replay scheme could work:

Take the SendHub site offline and make a snapshot of the database

Deploy a branch that will send all database queries to a logging server

Turn the site back on and begin importing the snapshot to the new database

When the import is finished, take the site offline

Replay the collected queries

Promote the new database

Bring the site back online

When I first described the above approach to one of my co-workers, his first response was, “Jay, you’re crazy”; however, given that the only alternative was a massive amount of downtime, I decided to write a remote logger and test out my hypothesis.

I wrote a simple query log capturing and replay system in PHP. Then I modified the SendHub Django application to forward all of its data modification SQL statements to the capturing system. I had everyone at the office help test on staging and the results were promising—the replay system cleanly reassembled the data.

The end result was that Ryan (our engineering lead) and I teamed up to execute the changeover procedure on production with a total of only 6 minutes downtime instead of 45 minutes. There was one minor hiccup though—we learned that when “debug” mode is not turned on Django query logging is disabled, so that set us back one or two minutes. All in all, it was an incredible success.

Below you can find our process and code for performing the upgrade.

At SendHub, we streamline our production site maintenance by preparing a document with detailed instructions including all the commands which will be run. Here is the planning document we used for this procedure: