Tracking the 2003 Northeast Power Grid Blackout

A case study of the 2003 power blackout in the northeast US as tracked and monitored by a real-time power monitoring system.

Steve Taranovich at EDN has posted an article outlining the events leading up to the major power grid failure in the northeastern and mid-western US and parts of Canada in 2003. The article includes an actual timeline video and images of the 2003 outage as well as the sequence of events that the Genscape Real-Time North American Power Product (Power RT) captured, recorded, and identified as the blackout was happening.

A real-time power monitoring system can provide a visual and audio indication showing where/when a generator has tripped offline, as well as an estimate of the number of MW that has come offline, the approximate time of the event, and historical frequency event data.

Genscape’s real-time monitors detect the cascade of the 2003 blackout with the loss of Homer City.

From the article:

The blackout's primary cause was a software bug in the alarm system at a control room of the FirstEnergy Corporation in Ohio. Operators were unaware of the need to re-distribute power after overloaded transmission lines hit unpruned foliage. What should have been a manageable local blackout cascaded into widespread chaos on the electric grid.

We will show how such a system [as Genscape's Power RT] alerts users to what and where problems are beginning to crop up and perhaps avoid such catastrophic events in the future.

"Genscape's proprietary power monitors .....detected the blackout cascade with the loss of Homer City."

There wouldn't have been a certain Mr Simpson involved in this would there??

Ona a more serious note, I remember a blackout we had once in Zimbabwe that affected most of the country and I think parts of the surrounding countries. The whole grid went unstable and the generators were loaded down till they almost stopped. I remember the fluorescent lights flashing slower and slower until they got down to around 1 flash a second, and I knew something was seriously wrong. I think it took over a day to get it all back on again.

Here in New England, we often times lose power during the winter due to snow, ice, tree damage to the power lines. Even during the summer there is the all too frequent car accident taking out a power pole. The one thing that I noticed a number of years ago was the NE power companies did not seem to be doing the line maintance: trimming back overhanging branches, cutting failing trees down and as a result we had many more outages. It seems that they would rather wait until some nasty weather conditions do the trimming for them only to have to get their poor crews out in the most ugly weather. Recently, they came to their senses and started doing the trimming and cleaning up of overhanging branches now we enjoy much better and more reliable power! This was a business decision not operator error (except of course cars hitting poles!) that contributed to both the number and duration of power outages.

this is so very true in so many areas. I can't count the number of times I've delt with programmers and web designers that couldn't wrap their head around the fact that the end user shouldn't conform to their specific and peculiar way of doing things. When the majority of people find an issue in your system, it doesn't matter if you think it is correct... it isn't.

I understand the huge impact of this kind of massive power failure during winter (especially during a snow storm) there in Canada & NA, because this is a very rare event and usually people are not well prepared for this kind of scenarios,

On a lighter note...I live in the southern part of India, winter is very pleasent here and can survive without power for a few days...and moreover, we are used to experience regular power cuts :) (due to supply-demand gap)

Ah yes DrFPGA---with more sophisticated technology comes more creative hackers who love the challenge of defeating the security of a system. This will surely be one of the most daunting tasks that we will face as we enter this next phase of a smarter Grid system.

To me the big issue going forward isn't reliability as much as security. With the move toward automated metering and control the chance of a 'hack' that tricks the system into thinking something is going wrong seems to be a much more important potential root cause of future failures...

In my vast experience working quality problems, if you are blaming the operator or training or documentation, you have failed to understand the problem or take corrective action seriously. It's usually a cover-up for inadequate procedures or systems. If you have designed the procedures and systems correctly, they are resilient to non-malicious operator actions.

Blaming the operator, documentation or training, especially first and only, is the hallmark of an immature quality & reliability approach. It's a cop-out to fix the blame instead of fix the problem.

Like most aircraft accidents and Three Mile Island, it was an accumlation of a bunch of "little" errors and loss of situational awareness by the operators that led to it. Behind it was also a poorly managed energy provider -- what they refer to in 8D problem solving as the root cause of the root cause.