Migration to Elasticsearch through the eyes of QA

In a recent blog post, Nick explained how and why we decided to double down on Elasticsearch. It took a combined effort of our developers and systems engineers to make it happen.

With millions of emails sent by Postmark every day, the switch also presented a challenge for the QA team. In this post I’d like to share the process we employed to ensure a high-quality, minimal risk deployment of this major infrastructure change.

Every database migration is hard since it can touch so many parts of the application. A migration like the one we did required testing of all parts of application - we needed to be sure that none of the features are broken. One thing which made migration from CouchDB to Elasticsearch so much easier was automated testing.

For automated testing we are using tests written in Ruby, and we run these tests through Jenkins. The UI is checked using selenium. We are running most of the tests every day, on both staging and production. This allows us to test our application all the time, find any new edge cases, tune up the tests if needed, and gather up a lot of testing data at the same time.

Once we started testing the migration from Couch, we already had lots of existing data gathered up by automated tests, which allowed easier regression and inconsistency testing.

We were running the tests frequently on staging, after each deploy until we got all our tests green. Once all tests were green, we did manual testing to verify everything was in place and nothing got missed by automated tests. During testing on staging, we also updated the automated testing suite with new tests, to better support the Elasticsearch updates.

Once we were sure that everything works as it should, we needed to be sure that sending speed is not compromised in any way. We ran a series of stress and performance tests to confirm that the new setup was performant under normal and heavy loads.

For email send testing, we were using mostly Gatling until we were satisfied with results on staging. We were comparing the data we got currently on staging with existing data we had, prior to the migration.

For testing our API, we were using our ruby tests which track performance of each API call endpoint to Librato. In Librato we analysed performance before and after the updates.

There was a problem though, as with every performance/stress test you run on staging, you can not be 100% sure the updates you see on staging will behave the same way in production.

The staging and production environments are rarely identical, so we always take the results we acquire on staging with reserve. We always performance test on production once we release.

Once we were satisfied with performance on staging and we finished automated testing, we did a manual test of the system prior to release. As the QA team approved the staging updates, we were ready to release the updates to production.

The release was carefully planned out and we made sure to do a release during off-peak hours - when there is the least traffic, so that we have the option to revert changes if we see any problems. Right after the release, we ran the full automated testing suite, the performance tests and checked data for consistency.

The first release was perfect, until we saw an issue when doing performance testing. In our isolated scenario, we noticed that in a rare case, under heavy load, documents could get lost.

Although rollbacks are never simple, we were (and are) unwilling to accept data loss, so we decided to rollback our first deployment of Elasticsearch when we detected this edge case.

Once we were sure we fixed the issues, we performed the deployment again. Everything went smooth, and all our tests passed this time.

Without testing tools, we would have never found the edge case. It would sit there in production environment, and show it’s face only when many emails were lost already. We avoided this all together.

With contingency plans, automated testing, and careful execution we were able to seamlessly switch to ElasticSearch, while detecting and handling minor setbacks. The success of this deployment is largely due to the meticulous planning of Milan, Nick, Chris and Michael.

Although it took longer than expected and we needed to revert the changes at first, it was well worth it. We made the right move - no emails were lost, and Postmark is sending emails faster than ever.