Articles

The interesting bit in this story is that upgrading to 5.7 requires a full table rewrite (<tt>ALTER TABLE</tt>) for any table that has time-related columns. Their initial test-run took months and still hadn’t finished.

AdStage made the move from Heroku to running their service directly on EC2, and in this article they explain why and how.

We were officially only getting about 2 ECUs per dyno, but the reality was that we were getting something closer to 6 since our neighbors on Heroku were not using their full share. This meant that our fleet of AWS instances was 3 times too small, […]

Language Warning: contains the word “sexy” used to describe new or interesting technology.

I’ve featured many articles from Mathias Lafeldt as part of his series, Production Ready. Now that he’s moved to Gremlin Inc (a SaaS helping customers run chaos experiments), Mathias reintroduces the history and theory of Chaos Engineering.

The folks behind Mail.ru implemented their own master-master replication system on top of Tarantool, a DBMS I’d never heard of. Their implementation is based on some details of their use-case that may not apply more broadly, but the design discussion is interesting nonetheless.

Facebook rewrote their tool, OnlineSchemaChange in Python (from the original PHP). OSC is a tool for doing DDL in MySQL without downtime.

The original open sourced OSC was more like an engine than a tool. Users needed to write PHP code wrapping to run the schema change, and, with PHP becoming less popular in the operations world, OSC.php wasn’t widely adopted by the community.

A basic introduction to structured logging, including rationale on why you’d want to use it. With infrastructures growing more and more complicated, I find structured logging indispensable in keeping everything up and running and debugging difficult problems.

New in the latest version of Elastic Stack (think ElasticSearch, Logstash, Kibana, etc) is built-in anomaly detection using machine learning, based on technology from Prelert (acquired by Elastic in 2016). “Machine Learning” — they might as well say it’s powered by “Lasers™”. If you try this out and have any success, please write up your results and send me a link!

Outages

Telia, a major backbone internet provider, deployed a misconfiguration that caused routing issues across the globe. CloudFlare noticed, as did Pingdom and Discord. Think back to almost a year ago, and you may remember that this isn’t the first time that they’ve caused this kind of far-reaching problem.