About

FastMail has been providing email accounts for 15 years to the world's most demanding users. We've built a reputation for reliability and fast response to our users' needs. And we keep you up to date with new features, and show you how to get most out of FastMail with regular updates on this blog.

The outage last week was a sequence of events caused by a recent internal change. We changed over our internal DNS server to slave off Opera’s servers to allow better internal DNS integration. Unfortunately we were only part way through that process, and we had only setup one internal server. It’s our general policy that everything we setup these days must be replicated between at least two servers which we had intended to do, but hadn’t got around to.

That internal DNS server was also running on the server that’s our primary database server. Unfortunately that server crashed with a kernel panic. Normally we’d just fail everything over to our replica database server, but because the internal DNS server was also down, all our tools which expected to be able to resolve internal domain names also failed, and we weren’t able to fail over easily. Also because the internal DNS was down, we weren’t easily able to access the remote management module (RMM) of the server to reboot it, and had to go through the NYI ticket system, which always takes a bit longer.

The net result is something that we should have detected within a few minutes, and easily failed over with our failover tools, took almost an hour to do in the end.

We’ve now setup the internal DNS servers to be part of our standard redundant setup. We’ve also setup consistent naming and IP addresses for all our RMM modules so that they’ll be easier to access, and even if there are DNS problems, we’ll be able to access them via IP.

We can’t stop servers crashing, but we aim to have every service redundant so that if any server fails, we can fail over to a replica within a short amount of time, either automatically where possible, or manually where we think it’s better to have some human intervention first.

Overall, I believe that our continuous attempts to improve reliability have been working very well, and we always aim to learn from any problems and do better.