Don't shoot the developer

You probably recall the big blackout that hit the northeastern United States
last August 14th. This week the North American Electric Reliability Council
released a report with the rather imposing title "August
14, 2003 Blackout: NERC Actions to Prevent and Mitigate the Impacts of Future
Cascading Blackouts February 10, 2004". In the report, NERC provides a whole
mess of things that should be done to make the electical system more reliable,
and orders specific corrective actions to be taken by FirstEnergy, Midwest
Independent System Operator, and PJM, the three entities who seem to be most
responsible for the events that launched the blackout.

Based on this report, some of the press is pointing fingers at the folks who
developed GE Energy's XA/21
System for monitoring and managing electrical power systems. Among the
things that FirstEnergy is ordered to do is to replace the XA/21 system and
"Until the current energy management system is replaced, FE shall incorporate
all fixes for the GE XA21 system known to be necessary to assure reliable and
stable operation of critical reliability functions, and particularly to correct
the alarm processor failure that occurred on August 14, 2003."

At this point things get a bit fuzzy, because various final reports aren't
available yet, but it appears that an unpatched bug in the XA/21 system in
August led to FirstEnergy not receiving alarms that things were wrong in their
system. Aha, think some people, it's that darned software breaking down again
that caused us to have to cancel our plans on that hot August night.

Comforting though it is to blame the software, I think in this case it's
unjustified, or at best unproven. There are several places we might look before
throwing the entire blame on GE's shoulders. For starters, unpatched software is
hardly the only thing that FirstEnergy got dinged for by NERC. They didn't train
their operators properly, they didn't bother to notify other systems when things
started to go south (a violation of existing NERC policies), and they didn't
even bother to generate electrical power that was up to national standards. They
didn't even bother to follow their own rules for how much power they
should be generating. They also get criticized for an ineffective "vegetation
management" program; apparently the proximate cause of the problems was not
computer failures, but tree limbs boinking into lines carrying 345 kV. And these
were "persistent problems" before August 14. The picture that emerges is one of
a pretty slipshod utility where pretty much everything was deteriorating.

Second, if the XA/21 system is so bad, how come we're not seeing it fail all
over the place? I couldn't find any sales figures on GE Energy's Web site
(hardly surprising), but given that they've got a dozen training courses
scheduled in Florida between now and June I assume they're selling a few copies.
Clearly there must have been some special circumstance at FirstEnergy to cause
it to fail the way that it did, and who knows yet whether it was a coding fault,
setup error, operator error, or what.

Finally, it seems like a violation of just plain good sense to use the same
monitoring software on both the main and backup monitoring system (if in fact
that's what was done; it's a bit hard to tell from the information that was
released so far). Maybe there aren't any other good pieces of software in the
market, but surely different software on the backup system would have been a
more robust way to set things up.

If there's anything clear about the August 14 blackout, it's that it was a
complex system failure. Software, training, and even tree limbs all played their
part. So why do people focus in on software as "the cause"? Perhaps it's just
the natural reaction to too many crashes on their home PCs. Whatever the case,
let's not jump to conclusions here, or try to simplify complex failures into a
single line of code.

About the Author

Mike Gunderloy has been developing software for a quarter-century now, and writing about it for nearly as long. He walked away from a .NET development career in 2006 and has been a happy Rails user ever since. Mike blogs at A Fresh Cup.