Quality minefield in a world where “everything is software!” – Looking back at the 2003 summer blackout

Do you remember where you were on Aug. 14, 2003?

It feels like a long time ago, but I got reminded of what it was like back then. In fact a number of articles in the papers highlight the drama that unfolded on that day:

2003-blackout-wall
http://www.flickr.com/photos/55976115@N00/2761841860/
“A mural on Dundas Street West reminding us of the night we all came together in the City, the night without Power – August 14, 2003.”

At the time, I used to work at a start-up in the west end of the city of Toronto close by Pearson International Airport. We were a small, but mighty, team of about a dozen developers who had embraced Agile development and particularly eXtreme Programming (XP). Quality wasn’t an after-thought. We believed in collective ownership of quality. Of course, we couldn’t live without TDD (Test Driven Development) and CI (Continuous Integration). We were proud of our tens of thousands of unit tests. We even had our interview and hiring process at times include pair programming.

My story for that day is similar to million of others in the city and around North America where we suddenly got a taste of a complete blackout. I remember having to walk along Bloor St. with probably tens of thousands of commuters since the entire subway line wasn’t working. I remember lots of people helping out with the traffic and the chaos. There were a number of convenience stores on the way who were giving away their ice cream for free. It turned out it wasn’t just a local problem. For example, in New York, the Restaurant Association had estimated a loss of between $75 million and $100 million in discarded food and lost business.1 It took me about 3 hours of walking to get to get home in downtown Toronto. I think I had walked about 20 KM (or about 12 miles). My greatest inconvenience on that day was probably the shoes I had – my feet were sore at the end of the day.

There were many others who had a more dramatic experience on that day. For example, there’s “the blackout baby“: his mother, Cara O’Neill, had to rush to the hospital on that very hot summer day after her water broke while the power was off through out Norwalk, Connecticut!

Unfortunately, others had it much worse. There were a number of deadly fires. In New York, a 72 year old died of smoke inhalation from a fire caused by a burning candle. Wikipedia lists the blackout contributing to eleven deaths.

In total, roughly 50 million people were impacted. In the summer heat the loss of power continued up to four days with an estimated cost of cost $6 billion USD.2 And it all got started with a tree branch touching a power line in Ohio! How did that cause such a domino effect with catastrophic consequences?

trees_and_poweline_400
Trees and Power lines
(Source: https://www.sce.com/SC3/Safety/treesandpower/)

As power stations went offline, monitoring alarm systems that were meant to track and monitor blackouts either failed or didn’t work as expected. At the heart of the failure was a an “Alarm and Event Processing Routine” written in C/C++ programming languages. With approximately one-million lines of code it took weeks of debugging to reproduce the problem and identifying the culprit as being a “race condition” in the code – unfortunately, a common and yet a very tricky programming error to notice. I remember a long time ago using JMock to test multi-threaded code. I think by using TDD, it allows the design or the shape of the code to evolve with testability in mind. Otherwise, I think it’s very difficult to test such scenarios. Here’s what Mike Unum, manager of commercial solutions at GE Energy who had worked at the company’s Florida laboratory to figure out what went wrong:3

“There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time. And that corruption lead to the alarm event application getting into an infinite loop and spinning.”

The Eenergy Library has a detailed sequence of events of the cascading effect that “ultimately forced the shutdown of more than 100 power plants”.

Looking back at the events of that day and reading about how such a software failure in a small alarm monitoring routine caused so much havoc reminded me about a talk I had attended by futurist and author Jim Carroll as the keynote speaker at the Toronto Agile Tour 2012. He had talked about how our dependency on software is growing exponentially where everything is becoming software. I believe Agile development has its roots in sound engineering practices. I’m glad to have been exposed to that early on and being part of a great team – we ended up walking towards the downtown most of the way together on that hot summer day in August!

If you have enjoyed reading this post, I have a follow-up: Quality minefield in a world where “everything is software!” – Our future

Sources:

  1. World Socialist Web Site: US: Impact of Northeast blackout continues to emerge
  2. SCIENTIFIC AMERICAN: The 2003 Northeast Blackout–Five Years Later
  3. Tracking the Blackout bug

Leave a Reply