The worst software bugs in history (and how to learn from them)
Wired’s article on History’s worst software bugs is a short and darkly entertaining read that makes us feel better about all the comparatively benign defects we’ve all shipped out to the world. We’d never make anything that bad, right? Wrong.
The difference between these stories and 90% of software developers is the context of the work. Few of us work on medical equipment, anti-lock brakes or nuclear weapon arming devices. We don’t work on things with the potential to kill or cost $100 of millions. For most of us, if we employed the same development practices we do on a daily basis on a mission-critical project, we’d make this list in no time. The difference between us and them isn’t skill, it’s domain.
Another problem with articles like this one is that they offer little to learn from. It’s to easy to laugh at how stupid the mistakes seem and gleefully return to writing code. These lists tend to give us unfounded confidence that it’s our approaches to work that makes us immune from these kinds of mistakes, but that’s not true.
Worse, unlike other kinds of disasters, such as airplane crashes, medical errors, or building collapses, few of these worst bugs in history have publically available analysis of why they happened and how we can avoid similar mistakes.
Learning from the past is not a strong part of the practitioner software development culture, and it’s a shame, since we repeat the same mistakes again and again. Understanding landmark failures is an integral part of most engineering disciplines (See Petroski’s To engineer is human: the role of failure), is not yet part of the software development culture, but it needs to be.
It’d be great for every CS major to study one of these software disasters before they graduate and understand something about how failures really happen before they start building things as professionals. Back at CMU the only course I took on engineering failures was in the humanities school (and it may have been the best course I’ve ever taken).
Here are some resources for learning from the mistakes of others:
- Risks list. A monthly catalog of reminders of how fragile technology is. People submit stories of potential mistakes and recent failures with emphasis on understanding and avoidance. Cuts across many technologies and businesses.
- Why software fails. My favorite essay from the above magazine issue: distills lessons out of the previously mentioned case studies.
- Digital woes. A book of 13 stories of software disasters, written for a layperson audience. It frames each disaster in a social and engineering context and has recommendations for people that are starting new projects. Highly recommended: easy read.
- Fatal defect. Similiar to Digital woes but different examples. Also well written.
- Primer on software testing. I’ve come to hate the word “testing” as it implies you fix software after it’s written – but this summary of techniques is a great refresher for anyone that’s writing code. It’s a reminder of how much is known about making quality software that most shops never even think to put into practice.
Anyone have any references I should add? How do you learn from the failures of others? Your own?
Lots of nice references here. I think than now,12 years after this post, you may add Gojko Adzic’s “Human vs Computer” https://gojko.net/books/humans-vs-computers/