Archive for the ‘Cloud Computing’ Category

One of our hosting providers, Linode, co-locate their servers in Hurricane Electric‘s (HE) Fremont Data center. Early today HE got hit by a massive (no other word really to describe it) DDoS.. it started at around 1:30AM GMT+8 and ended around 10AM (at least for us, with a couple of servers affected). Linode posted this update from HE:

On October 3rd we experienced a large attack against multiple core routers on a scale and in ways not previously done against us. We had various forms of attack mitigation already in place, we have added more. It was all fixable in the end, just the size and number of routers getting attacked and the figuring out what attacks were doing what to what took some time. The attack mitigation techniques we’ve added will be left in place. We are continuing to add additional layers of security to increase the resiliency of the network.

Because the attackers were changing their methods and watching how their attacks were responded to, we are not at liberty to elaborate on the nature of the security precautions taken.

This attack is interesting for a couple of reasons:

The core routers are the target. A typical DDoS usually targets a specific domain or service, by targeting the routers of HE the impact of the attack is broader ie it affected all the customers of the Data Center. When Amazon was attacked, the users hardly felt any degradation in peformance. That’s because the attack was against a domain and we already know that amazon has thousands of load balanced servers which regularly takes on the load of last minute shopping. This one was different, instead of attacking the servers, they attacked the core routers and router switches which act as ‘gateways’ to the load balancers, firewalls and servers. A core or edge router provides gateway routing and connectivity to dozens of other routers and possibly thousands of servers to the rest of the Internet, shut that router down and you’ve effectively made those thousands of servers inaccessible. The attack targeted “multiple” core routers at HE.

The attack was successful. New generation routers usually have built in anti-DoS features already, the fact that those where all overwhelmed means that a) the volume is simply too massive — its not really difficult to congest a pipe — and or b) a protocol exploit that used up a lot of CPU was used — e.g. BGP is frequently a target.

The attack was dynamic. HE mentioned that the attackers were changing their methods in real time and watching how their attacks are being responded to. Obviously, they’re not dealing with script kiddies here.

I could think of several scenarios on why somebody would do this (conspiracy hat firmly in place):

its a red herring — there really was a target– hosted by HE or one of its customers but the perpetrators wish to hide that fact or;

somebody has an ax to grind with HE.. could be a disgruntled network engineer, it can happen or;

its a proof-of-concept test — now this is a real concern. Obviously, the attackers have figured out a way to execute the attack dynamically and massively and considering that it took a jaded and arguably one of the most experienced data center operators almost 12 hours to stop the attack means that something new was done. One could argue that the reason why the attack stopped was not because HE was able to apply or adapt to the attack patterns (remember HE said that it was evolving), it could be that the attackers decided simply to stop ie they could have continued if they wanted and HE would have found a new attack pattern to apply rules against.

Whoever it is, and we’ll probably never know who he/she/they are, it is a very real major concern, specially if you’re in the business of hosting and service provisioning online. Unfortunately, if this happened to HE this can happen to anybody.

update 2:48PM GMT+8: well it looks like the attacker(s) simply went out to grab dinner.. linode is reporting that they’re experiencing another ‘stability issues’ with their Fremont (ie HE) ‘upstream’.

update 3:31PM GMT+8: network has stabilized ‘again’ according to linode.. just a quick clarification, apparently, its not only HE’s Fremont facility that was affected but their NY DC as well.. take no prisoners approach I see.

update 6:43PM GMT+8:apparently, a similar attack, albeit a limited one, happened to HE a week ago. Just a probe then..today was D-Day.

On September 28, 2011 10:20pm PDT and September 29, 2011 11:45am PDT, the Fremont 1 datacenter was subject to a DDOS targeting a core router. The attack caused OSPF and BGP reloads resulting in elevated CPU utilization and performance degradation of the router.

The incident on September 28, 2011 10:20pm PDT was identified and mitigated at 10:40pm PDT. The incident on September 29, 2011 11:45am was identified and partial mitigation was realized shortly thereafter with full containment at approximately 12:45pm PDT. All systems are fully operational at this time. We have already been in contact with the router vendor, and have obtained a new software image that addresses this type of infrastructure attack. We will be deploying the new image shortly. A maintenance notification will be sent out separately regarding this emergency maintenance.

Amazon just launched its tablet, the Kindle Fire. Aside from the price (its only $199, less than half of an ipad 2) one of the most interesting feature is their browser called Amazon Silk. The browser basically off-loads the heavy lifting of rendering and image optimization to their huge proxy/rendering farm (courtesy of AWS). The result is snappier pages and happier users.

At least thats the idea. Theres no doubt that infrastructure-wise this will work as ISPs have done this at some point to save bandwidth and improve user experience (squid being the most popular open source cache/proxy).

However, it seems like amazons engineers have pushed caching to the next level by rendering CPU-hogging javascripts and optimizing content (image resizing mainly) prior to delivery to the kindle Silk browser. So far so good.

Now for the privacy questions: how can amazon guarantee a) protection and b) anonymity of the session information and most importantly the data (eg username anf passwords) that will be “proxied” by the servers?

How will the browser deal with https traffic? Will that also be optimized too? (ie go through their servers)? I hope not!

That being said im looking forward to the getting my hands on them fondleslabs =)

I just read a mailed letter — that was languishing for more than a month on my desk — sent April 7th from GoGrid notifying me that they’ve had a security breach (the letter didnt say when) and that my card information may have been viewed by a hacker. So much for their “Security and Compliance in the Cloud” claim.

In a nutshell they are saying that an engineer made a routing error during some maintenance at 1am (reminds me of FAA’s air controller problems) sending traffic used by EBS for replication. When the EBS nodes could not replicate they thought that their backup peers were down and triggered an alternative backup mechanism which added to the load and driving load higher and causing other EBS nodes to begin backig up in panic mode. Thats the explanation.

Sounds logical enough but i dont buy it heres why:

1) any network operator worth their salt would have at least 2 routes for redundancy and using some kind of IGP for automatic redundancy and load balancing. I cant believe that with all that talk of high reliability they only have one link for the replication and using one so called “control plane” whatever that means.

2) I dont see why a ‘routing issue’ can lose an EBS volume (ie render them unrecoverable) unless somehow a high load can cause a hard drive to fizzle out or write bad blocks. Very unlikely.

3) The fact that it took them more that 3 days to restore volumes shows that it it is hardware related meaning the primary devices which were inexplicably rendered dead by a barrage of packets had to be restored using the backups. I wonder if the .07% unrecoverable volumes created just before or during the outage.

4) then theres the defeaning silence of all the AWS evangelists, Jeff Bezos and all other paid AWS bloggers. Its not that everybody is singing the same tune: nobodys singing at all!

In conclusion, i dont think this is a network event triggered disaster at all but an inherent infrastructure/ hardware design failure event. For the sake of many people who believes in AWS, I hope AWS learned their lessons really well and will make the necessary “fixes”.

7 days and still no post mortem as promised and it turns out the so called “network event” is actually a hardware failure resulting in some .07% of the EBS volumes lost. Now they are talking about “the hardware” in the singular, what happened to the high reliability claims? all snake oil?

Here’s their letter of apology (from Bus. Int)
————————–
Hello,

A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data.

What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility…

If you have no need for this snapshot, please delete it to avoid incurring storage charges.

Well its official, after 78 hours Amazon has declared the emergency over:

7:35 PM PDT As we posted last night, EBS is now operating normally for all APIs and recovered EBS volumes. The vast majority of affected volumes have now been recovered. We’re in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes.

If you believe you are still having issues related to this event and we have not contacted you tonight, please contact us here. In the “Service” field, please select Amazon Elastic Compute Cloud. In the description field, please list the instance and volume IDs and describe the issue you’re experiencing.

We are digging deeply into the root causes of this event and will post a detailed post mortem.

With the dearth of information from AWS, and from past experiences, we can only hazard a guess how the AWS infrastructure looks like. However, the Great AWS Outage (not completely recovered as of this writing) gives us some ideas.. this is a snapshot of the outage so far, showing that majority of the API services were affected.

Pending the official post-mortem.. here’s a couple of possibilities:

All these services run on the same public EBS layer. When that failed they all failed. This is the most likely reason but how does it explain the Elastic Beanstalk API failing as well (and this does not seem to be region-centric)? Also there could be a connection with the EBS failure on the 19th.. which resulted in a much bigger problem 2 days later. From the status page (EC2 N. Virginia):

[RESOLVED]Increased error rates for Instance Import APIs in US-EAST-1

4:38 AM PDT Between 02:55 am and 04:20 am PDT, the Instance Import APIs in the US-EAST-1 Region experienced increased error rates. The issue has been resolved and the service is operating normally.

The US-EAST API infrastructure failed. Call it the Battle: Los Angeles scenario.. where the weakest link of the invading aliens just happened to be their Command and Control (C&C) — the movie sucks btw and these bunch of aliens obviously haven’t learned the lessons of their cousins from ‘V‘ and ‘Independence Day‘. This will explain the Beanstalk failing as well. Which means this one is not replicated to the US-WEST or anywhere else.

Too bad, we were seriously considering migrating our databases to the RDS and were just waiting for the beta bugs to be weeded out.. what a relief.

Ok this will be my last post about the Great AWS Outage which, from the posts in the support forums, seems to be still affecting a substantial number of users, not quite the rosy picture AWS posted in their status board:

Apr 24, 2:06 PM PDT We continue to make steady progress on recovering the remaining affected EBS volumes. We are now working on reaching out directly to the small set of customers with one of the remaining volumes yet to be restored.

We’re getting better support and uptime with our lone linode box. I suspect, we’ll be moving some servers there after this disaster.

Well looks like the Great AWS Outage is now on its third day.. Despite the change on status from red to yellow on their status site the east availability zone is still not 100% up and running. For many us this is one big pin prick to our cloud balloon specifically the one that says ‘AWS’ on it.

The news is now being picked up everywhere and I don’t think they will be able to just sweep this under the rug.

The confidence of majority of AWS users have been shaken to the core! I don’t mean to sound so melodramatic but I could have been screaming for blood if we (in the west zone) got hit as well. As for this posts title..

As of this writing AWS has recovered enough EBS volumes from their backups (S3 I suppose) from the previous day. This could only mean that the networking event they were referring to is actually some disastrous component/hardware failure event. AWS users who were using EBS as their primary data store (since its supposed to be replicated within an availability zone — which translates to being backed-up in a single data center) were particularly effected. When we began using AWS a couple of years ago, we also got burned by EBS (Elastic B*** S***). EBS I/O latency and reliability is high and low respectively, after a couple of ‘events‘ (to borrow AWS’ favorite term) we realize EBS is not good for high I/O applications (e.g. database).

Hence our current (and soon to be expanded) strategy:

1) Use older AMI’s which defaults to activating the instance’s disk (the one which is physically part of the machine)

2) Use the instance’s disk space for high I/O applications store (e.g. database) and replicate to another instance and backup — in our case everything, including the ‘hosts’ file — several times a day against an attached EBS volume. EBS volumes “are automatically replicated on the backend (in a single Availability Zone)“. However, as experience have shown us (annually it now seems), this is not enough so…

3) Use S3 snapshots to backup the EBS volume in item 2. This way you have a third backup Snapshots are “automatically replicated across multiple Availability Zones” which means you’ll have a copy of your backups in different data centers within one region.

On a typical day, the AWS support forums are full of benign questions and a lot of pleading (since majority of users do not pay for Premium Support — the one that comes with a phone number to call — many end up posting requests in the forums in the hopes that an AWS engineer will pick their request and do something about it, us lot can be pathetic really). Well not today, there seems to be an Amazonian revolution in the offing with some calling for a class-action and others throwing in the towel. Here’s a favorite:

I am honestly going to miss you. You were my first and only time in the cloud. You were so experienced, and I was so… fresh. We had some great times. I remember that one time when we stayed up all night learning how to create AMI’s. We had so much fun launching and terminating, except for that one time when the accidental termination policy wasn’t set. Man that really sucked. I know you didnt mean it, just like now I know you didn’t mean to walk out on me for almost 2 days, but you gotta understand. I have needs. I need to get back in the cloud and I need support. You know the kind where it’s included in the overpriced retail cost. The kind where I can call up a support agent intead of begging Luke to stop my instance in the forums. I know Luke probably has a bigger keyboard then me, but that’s not the point. The point is that I thought we had a connection. Like a real port 21 connection. I guess I should have gotten the picture when you first refused my connection two nights ago in the wee hours of the morning. I thought it was just because I was drunk but it turns out I was wrong. I guess you just aren’t interested in me anymore… And that’s ok. It’s my time to move on. I hope you don’t forget me, because I won’t forget you and your cute little EBS volumes that I loved to attach and detatch.

Farewell EC2, We had a good go. I’m switching DNS.

P.S. I want my Jovi records back.

It would be interesting to know what really happened this time, and what AWS will do to soothe our collective ‘apprehensions’ and regain our lost — or at least degraded — confidence. They can start by immediately dropping prices across the board (to stop the migration out of AWS or at halt of expansion using AWS services — I already know we’ll do the latter) as we don’t really know (nor care) what they will be doing to make sure this so called network event does not happen again, because we already know.. it will.