How Swat.io migrated from MySQL to PostgreSQL in 2 years

Nothing can beat a good clickbait title. But this one bears truth in it. In fact I would argue that mentally the migration already started even before that as things got more and more complicated with MySQL.

If you’re a small team, back then we were only three, there are only so much resources and expertise you can invest in your infrastructure.

MySQL has many benefits and we certainly wouldn’t be where we are, without it. But over time its shortcomings factored in until they outweigh the benefits. Due to lack of deeper expertise we always relied on either Percona MySQL or the native version shipped with Ubuntu LTS 14.04 and never worked with external tools because, well, someone has to learn/understand/maintain these workflows.

The top reasons we were not satisfied with MySQL 5.6

Cannot online add a new column
A real problem for big tables as this forced us to have downtimes every other month. And having the lack of knowledge/resources for tools like pt-online-schema-change, this left us in despair.

Cannot, in a reliable way, add online indices
It is possible, but there are a few rules to follow and in practice it was always unnerving because we also felt we didn’t know what awaited us in production

(back then) No support for JSON column types (introduced in 5.7.8)

Lack of advanced features like CTE (Common Table Expressions) and Window Functions

We had complete lock ups where max_connections was exceeded and we could never find a source, internal or external, to our system. Literally hundreds of connections doing SELECT statements but nothing else. Eventually manually killing them “solved” it

There were other small or benign issues. Each on their isn’t a problem, but they all factored in to our dissatisfaction:

No native boolean type
Everyone accepts that tinyint(0) is the equivalent and has to perform native casting in all adapter languages or at least configure their ORMs/DataModels

Creating a full dump with mysqldump took multiple hours (3-4)
And it isn’t easily (read: built-in) possible to parallelize it

Importing that dump took multiple more hours than creating it (8-9)

DDL statements are not part of transactions

It was somewhen around 2014 when our former hosting partner 25th floor suggested to use Postgres, but despite the good words this wouldn’t automagically increase our teams’ resources to switch the whole application over.

Start small? Win!

Nevertheless we followed the advice and took a closer look at Postgres. In our simplemindedness we thought we could get away with a simple driver change. Boy were we wrong. The first PR with the vision to completely switch to Postgres was created in November 2014 and was titled “[POSTGRES] Brave new world”. However, it had too many problems, got abandoned soon and eventually was thrown away half a year later. Famous CTO response on the final comment: “kick it like beckham!”

But we didn’t give up. After some discussion and planning, we decided that we will base the new Swat.io Analytics Engine on Postgres. This turned out to be the perfect opportunity: everything would be done from scratch, so no ties to an existing system. The only downside: our primary data would still be in MySQL. To get it into Postgres, we needed synchronization jobs. Not ideal, but acceptable from a business perspective and thus the way forward for us.

The project itself took quite some months to complete, which, besides having to develop the new backend and UI code, also gave us ample time to learn the new database system:

How to we integrate it into our developers VM stack?

How does the role concept work?

How does it behave under write-intensive load?

How quickly can we prototype new features?

How do we tune it, if we are not satisfied with the query performance?

How does backup/restore work?

How does it integrate with our existing eco-system with PHP and NodeJS?

Mirror models? Fail!

The Analytics project was a success and we were so fond of Postgres but still remembered the failed attempt to migrate the whole application in one go that we tried another approach:
Create mirror models within our ORM which were connected to Postgres. On every save of the MySQL model we would trigger the same save to Postgres. While in theory this sounded nice, we always had gaps or certain fields not being properly updated that we had to abandon this approach after a while. Also having to apply such a workflow change across multiple repositories with different frameworks didn’t help exactly, either.

Continue small steps? Win!

When we hit our Summer camp 2015 we decided to move some logging tables to Postgres. The information herein is volatile by nature and recycled after some TTL anyway. This allowed us to forgo any migration of data and simply switch to the new database and let the old one discontinue and eventually remove it.

The next attempt

Around August 2015 our team hit some more frustration points again with MySQL and another attempt to migrate the application code base was made. This attempt was far more advanced than the first one and one could really “use” almost the complete application.

There was just one “minor” inconvenience: the performance was WAY behind MySQL! This was even easily measurable in our modest developers VM. We did micro-optimizations (e.g. disable SSL connections) and started to overhaul queries. But the main problem as we concluded: our application does (sometimes) hundreds and hundreds of small SQL statements during a page load and each one of them was just a tiny bit slower in Postgres, but it added so much up that the difference wasn’t acceptable.

Mind you that this conclusion was a major factor how any future optimization was done:

Do more on the database side (i.e. make smarter compound queries, etc.)

Embrace the database even more (just because it can be done client side, doesn’t mean it has to be).

Embarrassingly it wasn’t until almost 1 year later we found the culprit (CakePHP2 Postgres driver overhead), but nevertheless it led to so many optimization, a deeper understanding of what we need to and can do, that it ultimately lead do improvement which was beyond what we ever thought we were capable doing.

Unfortunately, this attempt stalled after a few weeks/months and was almost forgotten.

Challenge everything. Don’t give up.

Still, over time every new internal project pretty soon raised the question:

Do we want to create new tables in MySQL or Postgres?

This wasn’t always easy as we were embracing InnoDB features of foreign keys to keep our data consistent and having parts of it in another database is error-prone. But we absolutely were intrigued by having our main “pain” points with MySQL being basically non-existent:

As long as the new column is NULL and does not have a default value, it’s in practice a no-op to add it. No matter if your table size is 100MB or 100GB

Adding a new index never became an issue for the system, simply add them CONCURRENTLY and you’re done

As we didn’t just expand our data stores but also our code repositories, writing tests reached it’s then-peak around 2016 when we introduced a new backend: our JSONAPI based API. We had great experience with wrapping any kind of DML and DDL statements in transaction for our tests but ultimately had to abandon this approach as we still had our primary data in MySQL which didn’t support this.

Dawn of the Migration

The trend to Postgres however continued. Issues with MySQL haunted us more and more, maintenance windows during the night became more and more problematic as adding a few columns to our biggest tables started to take 2+ hours or sometimes was completely unpredictable and exceeded our announced downtime windows.

After a very productive spring and summer 2016 we made another attempt to migrate the whole application to Postgres. Based on the former year-old PR a new attempt was made. Eventually prioritization with the management became fruitful and the project was greenlit.

Having learned all the things in our attempt in 2016 led to an enrichment of performance and unleashing hidden capabilities of developers bug finding, which was unheard of before in our team. After many months of hard work and much much testing by the team, eventually in February 2017 we were, or so we thought, ready to make the move.

Whilst the migration of the data itself was successful, the final result had performance problems in a critical part of our applications. Unfortunately, we had to roll back, but we were eager determined to find the root causes and don’t give up.

Throughout our road to Postgres we received excellent support from Markus Winand of use-the-index-luke and modern-sql fame whose overall in-depth knowledge of multiple databases continues to amaze us every time. With his help and dedication from our team, we overcame the problems which in the end turned out to be a negative perception of performance due to the database being “cold”. As a final step to our migration we added pg_prewarm on selected relations and so enhanced the first-time experience for our most user-facing critical parts (an “issue” which basically vanished after a few hours of operation).

Our second attempt in March 2017 was finally a success. Although the data migration took longer than expected and during the final stages uncertainty spread while we monitored the first steps when we let the traffic hit Postgres, in the end, everything turned out to work awesome.

Our final migration in numbers

120GB of data migrated with pgloader, using custom scripts to manually parallelize it within tmux to use every core available (we couldn’t get a single pgloader instance reliably to work, so we used a dedicated one for each table)

The migration itself took around 4 hours:

Migrated the data itself

Added new indices we needed in Postgres to use its full potential

Created new foreign keys which previously were not possible

and ran VACUUM ANALYZE on all tables
In the end we ran up to 32 SQL scripts in parallel to speed things up (on could argue that this isn’t ideal but it certainly made our workflow easier)

To ensure the migration, with its high degree of concurrency, worked properly, it was probably performed about 20-30 times during the preparation, spanning a time frame of roughly two months.

We had to switch 8 code repositories (three of them with big code bases)

Over 250 comments were written across all pull requests to point out problems, track the state or discuss other issues

Over 150 commits were created and countless hours were spent to carefully review all code changes

We don’t have the number of GIT rebases we performed against master but taking a guess we would suggest around 50

During the time of the first failed and the second successful migration, we only found two real bugs in the code bases. We assume we still have bugs in certain edge cases but this experience gave us good confidence that we were on the right track.

Observation and lessons learned

After importing a batch of data, VACUUM ANALYZE is your friend to keep the query planner happy

datetime with time zone (which is what you want to use) obviously returns a different “string” format which needs to be ensured to not cause troubles when using improper date abstraction to work on them (read: directly modifying strings, probably not so smart)

Some predominate code base is still based on CakePHP2 and had its own set of very specific problems:

The Postgres driver did make an extra call to the database to fetch column metadata for each result received. This lead to increased latency due the many, even though small, overhead queries we had. Unfortunately, we only found this culprit very late in the development cycle but hey, eventually we did! Here’s how we workarounded it. Note that enabling in-application SQL logging didn’t reveal this overhead. We had to go deeper and activate log_statement.

We encountered problems with escaping certain data which we only found out a few days after the migration. We already had a special purpose Postgres driver and adapted it to better escape the data

The 63 character identifier limit posed a problem in certain places where we had used CakePHPs ORM to query on models which already had a very long column names. Due to the ORMs internal aliasing, fields would be truncated and thus overwrote each other upon returning to the application. That part had to be rewritten in pure SQL and the result transformed back into a CakePHP compatible structure.

Switching our Laravel repository was pure joy:

It allowed us to get rid of a lot of manual boolean casts due it’s native type support.
The Laravel repo was, compared to the other big repos, probably the only one which had more removed than added lines of code for the transition.

With a single exception, no framework code change had to be performed to get things working. All that was changed was configuration, database migrations and tests (due native types support mostly).

We were able to optimize our “threaded comments”-fetching code from “n * m” queries down to a single (albeit very big and complex) one, yielding great performance even on threads with hundred thousand of comments (using CTE and Window Functions)

Although overwhelming at first, the EXPLAIN (ANALYZE, BUFFERS) output is so much more insightful and useful compared to what you get from MySQL, it’s beyond comparison

The alphabetical ordering of results for MySQL vs. Postgres is just different. Deal with it. MySQL cooks its own thing, Postgres depends on the native OS features. (Hint: This is changing in PostgreSQL 10)

In MySQL, to make use of UTC (which we are using without exception throughout the whole system) you first have to import the time zones; in Postgres, it’s available out of the box

Triggers are of course also supported in MySQL, but for us, they turned out to be more useful in Postgres with its WHEN capability, being able to simplify the trigger function itself and decouple it from the condition when to apply it

By taking advantage of new things possible (partial indices, window functions) we were able to cut our average system load in half (went down from 8 to 4). We are aware that this is like comparing Apples and Oranges, because so many parts even within the application had to be adapted; but merely measuring this change was very interesting.

WAL archiving can be surprising. We had cases where seemingly innocent looking update statements suddenly produce 100GB of WAL files. You better want a filesystem you can expand online for that :-)

Refactoring is everything to our team; especially naming. Nothing beats consistency and lowering the mental overhead of how things are expected to be named, thus how to locate them quickly. As such, it was very welcomed that we finally could rename all the hundreds of indices and foreign keys to make sense – at any point in time without a downtime.

You cannot tell Postgres where you want to a particular column to be placed; this is sorely missed from MySQL where we grouped logical columns together.

As with everything, tuning a complex system is an art of its own. However, in comparison, we found the documentation of Postgres to be much enjoyable to read and not dig tossing to much from its focus. In our experience, the complex concepts and details of how Postgres works to take advantage of its full potential are much better explained then details of MySQL

Closing Words

The mental overhead of the two databases (because ultimately both served the same domain) and the maintenance headaches with MySQL started to have negative effects on our teams morale and Postgres finally put at end to this. It’s hard to find words describing the kind of satisfaction our team felt when we finally were able to push the button.

With all the baggage left behind we’re looking with great prospect into 2017 to improve performance even more, handling our scaling needs better and increase our customers satisfaction.

Comments

I’ve been working for the past 3 years on a project that has used Postgres from the start and there are two constant and very annoying problems you’re about to run into. One of them will (if promises hold) be fixed in PG 10, but the other will remain forever.

1. You cannot perform a live upgrade of your database. If you have replication, you can’t simply upgrade one server at a time and reconnect it to synch-up. You are forced into having down-time for every non-bugfix update.

2. You grow dependant on it’s extras that are unavailable in any other SQL database. For example, Array types are very convenient, but lock you into never being able to even consider changing to another database. Just looking at http://www.sql-workbench.net/dbms_comparison.html the only database that comes close is Oracle and it’s still far away and you get “not yet implemented” errors for things that are described in their documentation as if they exist.

For us, 1) is acceptable: we upgrade less often to a major version then we add columns; which we do pretty often (“no product is ever finished” 😁).

Your thoughts about 2) are interesting and we’ll keep that in mind. Such a databse migration is more like a one-time thing and if performance serves us, we will use database specific extensions (we do, actually). To me that’s just how reality is. We constantly refactor code anyway, so a lock-in as you describe it would just be “present” code thing and may be changed already tomorrow.

Thanks!

on 2017/04/25

By TomasV

Posting a comment here is rather broken, because of wrong SSL cert (apperently the cert is for *.walls.io, walls.io). I’m not going to retype the whole response again :-/

Good read and good to hear we’re not the only ones to have gone through this pain. Your story reads almost identically to ours 3 years ago!

From your lessons learned, on using VACUUM ANALYSE, after importing a lot of data (and assuming you haven’t updated any) you actually only need to use ANALYSE to update / refresh the query planner’s statistics. VACUUM is only required if a lot of data has *changed* as it clears up dead tuples – old versions of records kept for MVCC.

However, we actually did change *some* data during the migration. I already forgot the details and I’m to lazy to review the already-removed migration code, so I think in the end doing `VACUUM` was still the right call.

What be interesting to hear what tools you did use to perform the actual migration (e.g. we used pgloader/tmux/bash/custom SQL scripts with psql, etc.).

To be honest, it doesn’t hurt to do a VACUUM if you can afford the downtime (it locks the tables whereas ANALYSE can be run while the system is in normal use – caveat: ANALYSE may still hammer the storage).

We were migrating from another, much older, database tech and we had to perform cleansing and type marshalling on some of the data as part of the migration so we wrote a bespoke migration tool piping the data through libpq and using its binary format for performance.

@russellk vacuum do not locks tables, it is only vaccum full which locks tables. The main difference between vacuum analyze and analyze is vacuum must read all the table. Analyze read only a part of the table to have a statistic sample.

> “our application does (sometimes) hundreds and hundreds of small SQL statements during a page load and each one of them was just a tiny bit slower in Postgres, but it added so much up that the difference wasn’t acceptable”

Did MySQL query cache was enabled? It could explain that. But when workload grows it is recommended to disable query cache (its management can introduce large overhead)

> Did MySQL query cache was enabled? It could explain that. But when workload grows it is recommended to disable query cache (its management can introduce large overhead)

Yes, absolutely. Without it, I don’t think we would have had a good experience before switching to PostgreSQL. This caching unfortunately facilitates lazy developing and during our path to migration, we overhauled many of these problems and tried to minimize the number of queries which in some cases we were able to do drastically. Which of course also benefited MySQL; not everything is MySQLs fault here. When possible, we applied optimization already against MySQL instead of stacking it on top of the pile for the actual migration.

However, mind you this (which is explained later in the article too):
> > “our application does (sometimes) hundreds and hundreds of small SQL statements during a page load and each one of them was just a tiny bit slower in Postgres, but it added so much up that the difference wasn’t acceptable”

The real culprit here was the CakePHP2 Postgres drivers overhead. Without it, the difference for the common case was hardly measurable. ¯\_(ツ)_/¯

Apologies, you are correct.
I find in general that analyse has a much less negative and, especially when it’s freshly populated, much more positive impact on a live system compared to vacuum in any case.