All the attacks we have suffered so far have been bulk traffic attacks - including another one the day after we posted that blog. Apart from a short delay as the full protections kick in, things keep working through the attacks now.

But we were immediately aware on the very first day of the attacks that a complexity attack, using legitimate-looking requests, could still cause us serious availability issues. Our preferred way to avoid being attacked is to avoid pissing people off, but in a world where all email providers are under attack, who knows what's next?

Rate limits

Like any mature internet service, we have resource usage limits per user and per IP address to prevent abuse or misconfigurations doing damage to our service.

Everything is limited, from the number of emails sent per day, the number of login attempts per day, total bandwidth use, to the amount of storage used on the server. Most of these limits are set so high that no "reasonable" user could hit them. Of course, some users with multiple IMAP clients all configured to do a full scan of all folders with a login per folder every minute sure can hit them. There is software which does that, and we have to suggest that the users get better software or turn down the login rate, because it's indistinguishable from abuse to the rate limiter.

Costly operations

Logins are expensive. Deliberately so. We use bcrypt to hash user credentials so that even if our login database was stolen, it would be prohibitively expensive to reverse engineer the passwords. This is security best practice.

It also means that checking if a password is correct takes a lot of CPU. We rate limit login attempts per IP, but a botnet could eat all our CPU pretty pretty fast just testing passwords if it knows a ton of login names.

Email addresses are login names, and email addresses are generally widely known. Even if we wanted to change so that users used a secret login name instead, it would take time to switch everyone over.

Similarly, if you have IMAP access or even a logged in web account, it's possible to use a lot of IO resources. Even with per process resource limits in place, it's hard enough to accurately account IO that a few bad eggs can use significant amounts of resources.

Degrade gracefully

While none of these attacks, attempted login or IO complexity, can allow an attacker access to another user's mailbox - they can be used to block users out. If our servers are overloaded dealing with malicious requests, they won't be able to give timely access to benign requests.

So we added another component to our existing rate limiting and abuse prevention arsenal, and we called it the zebra. I was on call on Sunday when the initial DDoS hit, and I had an initial prototype and research finished by Sunday night.

It was built and tested on the Monday and deployed into production on Tuesday of DDoS week. It's one of the fastest turnarounds we've ever had from initial concept to major production component.

White and black lists

We called it zebra as a joke around black and white lists/stripes. The concept is pretty simple. There's a central server which collects data from all our hosts about IP addresses which have been good or bad. We already had a naughty list for MX hosts which get them a "talk to the hand" response for a while if they try to send us any email, and it's populated entirely with servers who have been bad to our servers specifically.

I knew the bad list was about a million IPs, and suspected that a good list would be about the same size. Rob M didn't think guessing was good enough, so I scanned the log files for the past week. I found 989,000 candidates. I call that a pretty good guestimate.

We all took a component, defined a protocol, and built our parts. Data collectors, the central server, and an edge node daemon.

IPSet

We run Linux, and we use iptables on our firewalls. We added an ipset rule, which allows us to filter on large numbers of IP addresses at once. The zebraclient listens for broadcasts from the central server, and updates the ipset with new ips and timeouts. For example on one server:

They all have a timeout at which the IP will expire from the list if not refreshed by the zebraclient - so it's a self cleaning system.

Zebra hooks have been inserted throughout our system, and different behaviours can cause an IP to be whitelisted, moved down to a less trusted list, forgotten, or even blacklisted which will cause it to be dropped at the boundary by all our hosts.

We use unicast UDP within the datacentre from the collection agents to the zebraserver, and broadcast from the zebraserver to the zebraclients. Our internal network is very reliable, particularly now that we're running 10 gigabit interlinks between the racks (since about 6 months ago when we had an outage due to overloading the 1 gigabit links with a massive backup migration), and active IPs will be notified quite frequently, so the theoretical unreliability isn't a problem in practice.

Shields up!

A single command can be run by our on-call admin if we find ourselves under a complexity attack. It will cause us to stop talking to anyone who's not in the whitelist. We tested it on our beta server for a couple of hours and watched to make sure that existing users could keep chatting to it. We tried from a couple of untrusted machines (test phones, a free wifi cafe link) and ensured that they couldn't get to the beta host on any service.

Knowing that the zebra is there, and keeping its lists up to date, makes us happy. It means if we do get hit with a targeted attack, we have a playbook already, and aren't trying to write tooling under pressure.

It was also really nice to work together as a team that quickly to deliver a major new component to our architecture and have it running in production within a day.

The whole stack

There are great benefits to owning our whole stack. We've built Linux kernel modules. We spec out the hardware that our servers run on. We contribute most of the work that goes to the open source Cyrus IMAP server.

This means we can change things really quickly. It's amazing thinking how much changes, and sometimes how quickly. The reverse ACLs change was in production for all users within 2 days of writing it.

We got the spec for Apple push events and 3 days later we turned it on for all users - it would have been quicker, but half our team were in Sunnyvale and the other half were back in Australia at the time, so some time was lost waiting for people to be awake. Also, Rob N insisted on writing the first pass himself without me sticking my nose in and helping - so he had to learn some bits of Cyrus that aren't very cleaned up yet.

And knowing all our tools meant we could could grow a zebra in a vat, pre-feed it the data I'd collected on Sunday night's investigation, and then keep feeding it live trustworthyness info about IP addresses so it can curate lists ready for an emergency. Meanwhile, we keep statistics on the effectiveness of those lists so we know how bad the impact would be if we go full white, or a blended response.