However, this article left a bad taste in my mouth and ultimately invites more questions than it answers. Frankly I felt like there was a large amount of hand-waving in DeSantis’ points that glossed over some very important issues related to security issues of late.

DeSantis’ remarks implied, per the title of the article, that to explain the poor handling and continuing lack of AWS’ transparency related to the issues people like me raise, the customer is to blame due to hype and overly aggressive, misaligned expectations.

In short, it’s not AWS’ fault they’re so awesome, it’s ours. However, please don’t remind them they said that when they don’t live up to the hype they help perpetuate.

I’m going to skip around the article because I do agree with Peter DeSantis on the points he made about the value proposition of AWS which ultimately appear at the end of the article:

“A customer can come into EC2 today and if they have a website that’s designed in a way that’s horizontally scalable, they can run that thing on a single instance; they can use [CloudWatch[] to monitor the various resource constraints and the performance of their site overall; they can use that data with our autoscaling service to automatically scale the number of hosts up or down based on demand so they don’t have to run those things 24/7; they can use our Elastic Load Balancer service to scale the traffic coming into their service and only deliver valid requests.”

“All of which can be done self-service, without talking to anybody, without provisioning large amounts of capacity, without committing to large bandwidth contracts, without reserving large amounts of space in a co-lo facility and to me, that’s a tremendously compelling story over what could be done a couple years ago.”

Completely fair. Excellent way of communicating the AWS value proposition. I totally agree. Let’s keep this definitional firmly in mind as we go on.

Here’s where the story turns into something like a confessional that implies AWS is sadly a victim of their own success:

DeSantis said that the reason that stories like the DDOS on Bitbucket.org (and the non-cloud Sidekick story) is because people have come to expect always-on, easily consumable services.

“People’s expectations have been raised in terms of what they can do with something like EC2. I think people rightfully look at the potential of an environment like this and see the tools, the multi- availability zone, the large inbound transit, the ability to scale out and up and fundamentally assume things should be better. “ he said.

That’s absolutely true. We look at what you offer (and how you offered/described it above) and we set our expectations accordingly.

We do assume that things should be better as that’s how AWS has consistently marketed the service.

You can’t reasonably expect to bitch about people’s perception of the service based on how it’s “sold” and then turn around when something negative happens and suggest that it’s the consumers’ fault for setting their expectational compass with the course you set.

It *is* absolutely fair to suggest that there is no release from not using common sense, not applying good architectural logic to deployment of services on AWS, but it’s also disingenuous to expect much of the target market to whom you are selling understands the caveats here when so much is obfuscated by design. I understand AWS doesn’t say they protect against every threat, but they also do not say they do not…until something happens where that becomes readily apparent 😉

When everything is great AWS doesn’t go around reminding people that bad things can happen, but when bad things happen it’s because of incorrectly-set expectations?

For instance, DeSantis said it would be trivial to wash out standard DDOS attacks by using clustered server instances in different availability zones.

Okay, but four things come to mind:

Why did it take 15 hours for AWS to recognize the DDoS in the first place? (They didn’t actually “detect” it, the customer did)

Why did the “vulnerability” continue to exist for days afterward?

While using different availability zones makes sense, it’s been suggested that this DDoS attack was internal to EC2, not externally-generated

While it *is* good practice and *does* make sense, “clustered server instances in different avail. zones, costs money

Keep those things in the back of your mind for a moment…

“One of the best defenses against any sort of unanticipated spike is simply having available bandwidth. We have a tremendous amount on inbound transit to each of our regions. We have multiple regions which are geographically distributed and connected to the internet in different ways. As a result of that it doesn’t really take too many instances (in terms of hits) to have a tremendous amount of availability – 2,3,4 instances can really start getting you up to where you can handle 2,3,4,5 Gigabytes per second. Twenty instances is a phenomenal amount of bandwidth transit for a customer.” he said.

So again, here’s where I take issue with this “bandwidth solves all” answer. The solution being proposed by DeSantis here is that a customer should be prepared to launch/scale multiple instances in response to a DoS/DDoS, in effect making it the customers’ problem instead of AWS detecting and squelching it in the first place?

Further, when you think of it, the trickle-down effect of DDoS is potentially good for AWS’ business. If they can absorb massive amounts of traffic, then the more instances you have to scale, the better for them given how they charge. Also, per my point #3 above, it looks as though the attack was INTERNAL to EC2, so ingress transit bandwidth per region might not have done anything to help here. It’s unclear to me whether this was a distributed DoS attack at all.

Lori MacVittie wrote a great post on this very thing titled “Putting a Price on Uptime” which basically asks who pays for the results of an attack like this:

“A lack of ability in the cloud to distinguish illegitimate from legitimate requests could lead to unanticipated costs in the wake of an attack. How do you put a price on uptime and more importantly, who should pay for it?“

This is exactly the point I was raising when I first spoke of Economic Denial Of Sustainability (EDoS) here. All the things AWS speaks to as solutions cost more money…money which many customers based upon their expectations of AWS’ service, may be unprepared to spend. They wouldn’t have much better options (if any) if they were hosting it somewhere else, but that’s hardly the point.

I quote back to something I tweeted earlier “The beauty of cloud and infinite scale is that you get the benefits of infinite FAIL”

The largest DDOS attacks now exceed 40Gbps. DeSantis wouldn’t say what AWS’s bandwidth ceiling was but indicated that a shrewd guesser could look at current bandwidth and hosting costs and what AWS made available, and make a good guess.

The tests done here showed the capability to generate 650 Mbps from a single medium instance that attacked another instance which, per Radim Marek, was using another AWS account in another availability zone. So if the “largest” DDoS attacks now exceed 40 Gbps” and five EC2 instances can handle 5Gb/s, I’d need 8 instances to absorb an attack of this scale (unknown if this represents a small or large instance.) Seems simple, right?

Again, this about absorbing bandwidth against these attacks, not preventing them or defending against them. This is about not only passing the buck by squeezing more of them out of you, the customer.

“ I don’t want to challenge anyone out there, but we are very, very large environment and I think there’s a lot of data out there that will help you make that case.” he said.

Of course you wish to challenge people, that’s the whole point of your arguments, Peter.

How much bandwidth AWS has is only one part of the issue here. The other is AWS’ ability to respond to such attacks in reasonable timeframes and prevent them in the first place as part of the service. That’s a huge part of what I expect from a cloud service.

Interesting chat I had yesterday with a vendor – he said, and I paraphrase, that the economic downturn had lowered their tri-ers, but actually increased their buyers. He said his marks were actually doing their homework, researching his services, analyzing ROI (I know, strange idea, right?) and not coming on board if it didn't add up.

"Two years ago, that was something we never saw" he said. He said the pain of disappearing prospects was somewhat offset by the fact that the ones who did buy were pretty pleased; they knew the service would add up for them. So less short term growth, more long term retention.

Point being, What's it going to take for AWS users to start doing their homework? Or will they always fall for the easy-peasy "try it for $0.10" pitch that AWS has? And does that self-serve, all-you-can-eat model mean that Amazon has MORE responsibility (that it is not accepting) to take care of it's doltish users?

I mean, you can't charge for the ride and just shrug when a kid falls off, right?

Finally! At least a small glimpse of what 'the cloud' really is: an environment (servers, applications, networks, load-balancers) that enable a website (horizontally scalable application) to grow/shrink on demand, or I guess I should say 'with demand" to provide a useable application experience…by configuring the environment beforehand.

As for the eDoS, in a perfect world, the attack would be detected in less than 1/10second, and ingress filtering would be applied, up-to-and-including dropping carrier on the offending port. Needless to say, there isn't enough 1) trust and 2) communication and 3) established procedures between ISPs and backbone providers to prevent these massive e/DDoS attacks from continuing much longer than necessary.

If I have a connection to MyISP, Inc., I should be able to log in to my account, upload a packet trace in tcpdump format, and click on Report DOS. Their system should do the following: 1) verify my identity/affiliation, 2) verify that the dest-ip in the packet is registered to me and that they are the last-hop route, 3) verify that the attack is still in progress, 4) use their internal routing information to determine where this packet is entering their network, 5) confirm that it is still entering their network at that location, 6) transmit, via API, the information to the upstream provider, and repeat 3-6 until 7) the packets are tracked to their source ISP (whether or not the traffic IS supposed to ingress from this port is a different matter). At this point, the ISP has many options, from doing nothing all the way up to pulling the plug on the client. A logically simple, but probably technically complex, solution would for ISPs to have a DoSBlock ingress rule, that with the simple addition of SA and DA to the rule, the DoS, while still still on-going as far as the source is concerned, has been stopped as far as the target (and intermediate network operators) is concerned.

Let's look at this from a different perspective. 15 hours. After 15 hours, how many of the dDoS source machines were identified? 1? 2? all of them? Actually, I'd be interested to know if it is even a non-zero number. If ISPs want to have free-reign to packet-shape BitTorrent down to 10k and give Skype priority, will they also become RESPONSIBLE for not throttling DoS attacks to 0k?