Archive for the ‘aws’ Category

Amazon just launched its tablet, the Kindle Fire. Aside from the price (its only $199, less than half of an ipad 2) one of the most interesting feature is their browser called Amazon Silk. The browser basically off-loads the heavy lifting of rendering and image optimization to their huge proxy/rendering farm (courtesy of AWS). The result is snappier pages and happier users.

At least thats the idea. Theres no doubt that infrastructure-wise this will work as ISPs have done this at some point to save bandwidth and improve user experience (squid being the most popular open source cache/proxy).

However, it seems like amazons engineers have pushed caching to the next level by rendering CPU-hogging javascripts and optimizing content (image resizing mainly) prior to delivery to the kindle Silk browser. So far so good.

Now for the privacy questions: how can amazon guarantee a) protection and b) anonymity of the session information and most importantly the data (eg username anf passwords) that will be “proxied” by the servers?

How will the browser deal with https traffic? Will that also be optimized too? (ie go through their servers)? I hope not!

That being said im looking forward to the getting my hands on them fondleslabs =)

In a nutshell they are saying that an engineer made a routing error during some maintenance at 1am (reminds me of FAA’s air controller problems) sending traffic used by EBS for replication. When the EBS nodes could not replicate they thought that their backup peers were down and triggered an alternative backup mechanism which added to the load and driving load higher and causing other EBS nodes to begin backig up in panic mode. Thats the explanation.

Sounds logical enough but i dont buy it heres why:

1) any network operator worth their salt would have at least 2 routes for redundancy and using some kind of IGP for automatic redundancy and load balancing. I cant believe that with all that talk of high reliability they only have one link for the replication and using one so called “control plane” whatever that means.

2) I dont see why a ‘routing issue’ can lose an EBS volume (ie render them unrecoverable) unless somehow a high load can cause a hard drive to fizzle out or write bad blocks. Very unlikely.

3) The fact that it took them more that 3 days to restore volumes shows that it it is hardware related meaning the primary devices which were inexplicably rendered dead by a barrage of packets had to be restored using the backups. I wonder if the .07% unrecoverable volumes created just before or during the outage.

4) then theres the defeaning silence of all the AWS evangelists, Jeff Bezos and all other paid AWS bloggers. Its not that everybody is singing the same tune: nobodys singing at all!

In conclusion, i dont think this is a network event triggered disaster at all but an inherent infrastructure/ hardware design failure event. For the sake of many people who believes in AWS, I hope AWS learned their lessons really well and will make the necessary “fixes”.

7 days and still no post mortem as promised and it turns out the so called “network event” is actually a hardware failure resulting in some .07% of the EBS volumes lost. Now they are talking about “the hardware” in the singular, what happened to the high reliability claims? all snake oil?

Here’s their letter of apology (from Bus. Int)
————————–
Hello,

A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data.

What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility…

If you have no need for this snapshot, please delete it to avoid incurring storage charges.

Well its official, after 78 hours Amazon has declared the emergency over:

7:35 PM PDT As we posted last night, EBS is now operating normally for all APIs and recovered EBS volumes. The vast majority of affected volumes have now been recovered. We’re in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes.

If you believe you are still having issues related to this event and we have not contacted you tonight, please contact us here. In the “Service” field, please select Amazon Elastic Compute Cloud. In the description field, please list the instance and volume IDs and describe the issue you’re experiencing.

We are digging deeply into the root causes of this event and will post a detailed post mortem.

With the dearth of information from AWS, and from past experiences, we can only hazard a guess how the AWS infrastructure looks like. However, the Great AWS Outage (not completely recovered as of this writing) gives us some ideas.. this is a snapshot of the outage so far, showing that majority of the API services were affected.

Pending the official post-mortem.. here’s a couple of possibilities:

All these services run on the same public EBS layer. When that failed they all failed. This is the most likely reason but how does it explain the Elastic Beanstalk API failing as well (and this does not seem to be region-centric)? Also there could be a connection with the EBS failure on the 19th.. which resulted in a much bigger problem 2 days later. From the status page (EC2 N. Virginia):

[RESOLVED]Increased error rates for Instance Import APIs in US-EAST-1

4:38 AM PDT Between 02:55 am and 04:20 am PDT, the Instance Import APIs in the US-EAST-1 Region experienced increased error rates. The issue has been resolved and the service is operating normally.

The US-EAST API infrastructure failed. Call it the Battle: Los Angeles scenario.. where the weakest link of the invading aliens just happened to be their Command and Control (C&C) — the movie sucks btw and these bunch of aliens obviously haven’t learned the lessons of their cousins from ‘V‘ and ‘Independence Day‘. This will explain the Beanstalk failing as well. Which means this one is not replicated to the US-WEST or anywhere else.

Too bad, we were seriously considering migrating our databases to the RDS and were just waiting for the beta bugs to be weeded out.. what a relief.

Ok this will be my last post about the Great AWS Outage which, from the posts in the support forums, seems to be still affecting a substantial number of users, not quite the rosy picture AWS posted in their status board:

Apr 24, 2:06 PM PDT We continue to make steady progress on recovering the remaining affected EBS volumes. We are now working on reaching out directly to the small set of customers with one of the remaining volumes yet to be restored.

We’re getting better support and uptime with our lone linode box. I suspect, we’ll be moving some servers there after this disaster.