What other such incidents you are aware of that you believe created a large amount of havoc affecting the largest # of users? What is there to be learned from such incidents? How have those companies publicly responded to their downtimes?

This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.

13 Answers
13

The Northeast Blackout of 2003 was a massive widespread power outage that occurred throughout parts of the Northeastern and Midwestern United States and Ontario, Canada on Thursday, August 14, 2003, at approximately 4:15 p.m.Eastern: UTC -5. At the time, it was the second most widespread electrical blackout in history, after the 1999 Southern Brazil blackout.[1][2] The blackout affected an estimated 10 million people in Ontario and 45 million people in eight U.S. states.

A software bug known as a race condition existed in General Electric Energy's Unix-based XA/21 energy management system. Once triggered, the bug stalled FirstEnergy's control room alarm system for over an hour. System operators were unaware of the malfunction; the failure deprived them of both audible and visual alerts for important changes in system state.[11][12][13] After the alarm system failure, unprocessed events queued up and the primary server failed within 30 minutes. Then all applications (including the stalled alarm system) were automatically transferred to the backup server, which itself failed at 14:54. The server failures slowed the screen refresh rate of the operators' computer consoles from 1–3 seconds to 59 seconds per screen. The lack of alarms led operators to dismiss a call from American Electric Power about the tripping and reclosure of a 345 kV shared line in northeast Ohio. Technical support informed control room personnel of the alarm system failure at 15:42.[14]

My money is on Amazon, June 6th, 2008.
At approximately 10:25am PST the Amazon retail site became unreachable.
All other Amazon servers and services functioned properly. Furthermore, https access to the site was available.
The site was down for ~2 hours.Estimates are that Amazon lost a potential income of $31,000/minute and a lot of credibility (Amazon stocks went down 2.7% that day).
The root cause is assumed to have been a faulty definition in the load balance layer, but no one from Amazon will confirm/deny.

There has been a 3 hours Amazon S3 and EC2 services outage in 2008 that affected thousand of websites including Twitter (storage), and 37 Signals for example.According to amazon this was due to scability problems (ref link):

Here’s some additional detail about the problem we experienced earlier today.
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.

As we said earlier today, though we're proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.

Another Twitter incident reported here, is when Steve Jobs and the MAc world relied on it during one of Steve Jobs' speeches, and it succumbed to the load, on Jan 15 2008.

Most eyes in the tech world are
currently on the Steve Jobs keynote at
Macworld (detailed live updates here
for you Apple fans). For those of us
not in attendance,
Twitter was presumed to
be a good outlet to find out what’s
going on and discuss each twist and
turn with our community. Alas, Twitter
has once again crashed under a
official surge in traffic
from Macworld, and has been largely
inaccessible for the last hour.

As a consequence, T-Mobile's complete mobile network failed to function for several hours. The failures started around 4pm, and were only resolved around 9-10pm. The outage affected most (possibly almost all) of T-Mobile's 40 million subscribers, who could not receive calls (with some still able to make outgoing calls).

Almost as embarassing as the outage was the compensation offered by T-Mobile: They let subscribers send SMS (normally 0,19 Euro per SMS) free of charge for one day (a Sunday). Especially business customers certainly appreciated the gesture, which was thoughtfully restricted to a non-business day...