Real tales of cyberattack response and recovery are hard to come by because organizations are reluctant to share details for a host of legitimate reasons, not the least of which is the potential for negative financial fallout. However, if we never tell our stories we doom others to blindly walk our path. Better to share our real-world battlefield experiences and become contributors to improved threat intelligence.

We are a SaaS-based supplier of Web content management for mid- and large-size enterprises. Our customers manage hundreds, sometimes thousands of websites around the globe in high profile industries such as pharmaceuticals and financial services. The customer in this story prefers to remain anonymous, but to provide some context, the company is a large, public healthcare services organization focused on helping providers improve financial and operational performance. The company counts thousands of hospitals and healthcare providers as clients, managing billions in spending.

The scale of this particular DDoS attack was enormous – at its peak 86 million concurrent users were hitting our customer’s website from 100,000+ hosts around the world. The FBI was called. When it was all over 39 hours later, we had mounted a successful defense in what proved to be an epic battle. Here’s how it happened.

Initial Attack Vector

On the eve of the company’s annual conference where it was set to host 15,000 attendees, we received a troubling alert. The company’s web servers were receiving unbelievable amounts of traffic. The company is a SaaS provider of content and analytics for its clients, so this slowdown had the potential to dramatically impact service availability and reputation. There was no time to waste.

On the initial attack vector:

All of the requests were 100% legitimate URLs, so we couldn't easily filter out malicious traffic

Attacks were originating from all around the world – including North Korea, Estonia, Lithuania, China, Russia and South America

60% of the traffic was coming from inside the United States

The attack was de-referencing DNS and attacking IP addresses directly

We were able to successfully defend against this initial wave by going in behind the scenes to our courtesy domain in Amazon’s Route 53, rearranging things a bit and immediately cutting out the traffic to those IP addresses. Things returned to normal and we breathed a sigh of relief… in our ignorance, thinking everything was going to be OK. As it turns out that was only the first wave. Next came the tsunami.

That evening the attackers came back, and came back with a vengeance, targeting the site via its DNS name, which meant we couldn’t employ the same IP blocking tactic we had used earlier. Traffic shot up dramatically.

The existential question you ask yourself in moments like this is, “Are we going to lie down and die, or are we going to step it up?” This led to a seminal moment in conversation with the customer CIO, and we decided with a handshake to step it up. As SaaS companies, our ability to deliver continuous, reliable service is paramount, so both of our reputations were on the line. We agreed to share the cost -- potentially tens of thousands of dollars -- to fight the good fight.

Looking at the second wave’s traffic, we realized there were a few immediate mitigations that we could easily implement:

* This particular organization only serves U.S. customers, yet a lot of the traffic was coming from outside the country. We quickly implemented some firewall rules that had been battle tested from our work with Federal agencies, which would admit only U.S.-based traffic. This immediately stopped 40% of the traffic at the front door.

* We inserted a web application firewall behind our AWS Route 53 configuration and scaled up some HA Proxy servers, which would gather a lot of logging information for the FBI – who had now become our new best friends – to analyze after the fact.

* Third, we intentionally broke our auto scaling configuration. Auto scaling has triggers for scale-up and scale-down. We changed the scale-down trigger to make it much higher than the scale-up trigger. What that meant was the system would scale up properly as more traffic came in, but would never hit the scale-down threshold. As a result, every instance that was launched would stay in service permanently, leaving its logging information intact for harvesting by the FBI.

It was now 1:00 a.m. We put our game faces on. The arms race had begun.

DDoS Day Two

Our attackers scaled up. Amazon Web Services scaled up. Our attackers scaled up some more. AWS scaled up some more. This continued into Day Two. At this point we were providing hourly updates to our customer’s board of directors.

At the height of the DDoS attack, we had 18 very large, compute-intensive HA Proxy servers deployed and almost 40 large web servers. The web server farm was so large because – even though we had excluded the non-U.S. traffic component, representing 40% of the overall load – the remaining 60% consisted of legitimate URLs originating from within the United States, most of which were accessing dynamic services that could not easily be cached. Traffic was hitting an extremely large, globally-distributed infrastructure. Our highly-scaled web server farm was deployed behind a very substantial HA Proxy firewall/load-balancer configuration. This in turn sat behind CloudFront, AWS’ globally-distributed content delivery network, which itself was deployed behind Route 53, AWS’ globally-redundant DNS platform. This was an infrastructure of very significant dimensions, scalable and secure at every tier.

At around until 7 p.m. that evening, something fantastic happened. We scaled up… but the bad guys didn't scale up anymore. At this point we were sustaining 86 million concurrent connections from more than 100,000 hosts around the world. We measured the traffic, and were shocked to see that we were handling 20 gigabits per second of sustained traffic through the AWS infrastructure. This equates to 40 times the industry median as observed in DDoS attacks in 2014, according to Arbor Networks. We continued to serve the website at a response rate of about 1-3 seconds per page.

Our attackers had run out of gas. They hammered us and hammered us until they simply gave up and went home. At the end, the company’s CIO told us that if they had hosted the site in their own data center they would have been out of options and unable to respond a mere eight hours into the attack. Remember that handshake agreement to share the cost of defense? At the end of the day, the total cost in Amazon Web Services fees to successfully respond and defend this assault for 36 hours amounted to less than $1,500.

How to Prepare for a DDoS Attack

We survived because we were prepared, but this experience gave us additional insights for surviving a large-scale DDoS attack. Here are some things you can do to fortify your data center and protect your corporate website(s):

Design, configure and test your infrastructure to withstand DDoS attacks. Set up these tests with your hosting provider’s knowledge, and hopefully their assistance.

Be aware of what normal is in your environment and set alarms for when “normal” isn't happening.

Alias your public-facing domain name to an internal courtesy domain. This will allow you to respond very quickly and make DNS changes in real time, behind the scenes, without having to depend on third party service providers.

Learn (and practice) how to efficiently manage a DNS change in the middle of potentially challenging scenarios.

Traffic testing should spin up lots of multi-threaded requests from hundreds of attack vectors in an attempt to exhaust server resources very quickly. Run each test for at least three hours to sustain your response over time, and then leave an adequate cool down period in between. Obtain explicit permission before mounting any kind of significant test or risk suspension and/or cancellation of your service.

When building auto scaling configurations, don’t use CPU load as a metric. Instead, the best evidence of DDoS is an increase in the number of inbound HTTP requests, so the best alarm to set is the “Network In” trigger.

Scale up quickly but decay slowly, at a 4:1 or 2:1 threshold ratio of scale-up to scale-down. This allows for very rapid response to an initial attack and will reduce the likelihood of having to scale up again, should your attackers return in multiple waves in a “hit and run” style.

If using AWS Elastic Load Balancing, activate the “cross-zone load balancing” option. It is the best option for even distribution of traffic across your back-end server farm and significantly reduces load on your DNS infrastructure.

I firmly believe that together as an industry we need to collaborate on a better collective understanding of our adversaries’ tactics, techniques and procedures to stay one step ahead of the bad guys.