Failure Is An Option

Failure is a word that, understandably, carries a negative connotation. Nobody wants to fail, really. But failure, if you’re doing anything worthwhile, is inevitable. What’s important is to plan for failure, learn from it, try to avoid damage and do your best to recover gracefully. That was the topic of Selena Deckelmann’s keynote, “Mistakes Were Made,” Sunday morning at the Southern California Linux Expo (SCALE).

Deckelmann is founder and COO of Prime Radiant, the company behind Checkmarkable, “a product that helps organizations document, share and tweak their processes.” Deckelmann is also a longtime contributor to the PostgreSQL project.

As a member of the open source community for some time, Deckelmann has seen plenty of public failure. One of the purposes of open source, says Deckelmann, is to “teach the world to fail.” That’s not to say that open source is a failure. The point is that failure happens, and in open source projects it happens in full view. Deckelmann says that the open source communities are “experts at studying failure, collaboratively.”

Stop, Drop and Roll

The important thing is to plan for when things fail, not if things fail. As we all know, mistakes will be made and failures will happen. It’s how the failures are dealt with that really makes a difference. Wanting a memorable phrase, Deckelmann says that she was reminded of “stop, drop, and roll.” But that might not be terribly useful for tech organizations dealing with project failures. Instead, Deckelmann suggests a five-point plan:

Document
Test
Verify
Imagine
Implement

Documentation

Documentation is the first area where many organizations fall down. Deckelmann stressed that documentation is an important step in any process, but it’s often overlooked. Even when documentation exists, there’s often not a process for ensuring that it is updated. Deckelmann says that organizations need to ensure that documentation is a step in written processes, and that a fixed amount of time should be assigned to it.

Testing

Next is testing. Here, Deckelmann was largely working within the context of IT projects. For example, a database migration or moving servers from one datacenter to another. For successful testing, you need to start by verifying the success criteria. You need to have an accurate picture of success before you can create tests. The next step is to write tests, then conduct testing with a “buddy” to ensure at least two people are conducting it. Deckelmann also stressed the importance of a staging environment, when possible.

Verify

Verification is another important step in any project plan testing. Deckelmann gave an example of a database migration using PostgreSQL 8.3. A test plan was written and they had repeatable shell scripts to use for the project, which had been tested on some test data (but not production data). Unfortunately, they hadn’t verified that they were using the right option for the PostgreSQL dump tool – so what should have been a very short process dumping from the database turned out to be a 12-hour (or longer) process that lasted well past the projected maintenance window. Oops.

This also stresses the importance of having a plan for when things go wrong, and testing the rollback plan as well as the implementation plan.

Imagine

Failure does not come in only one flavor. You have to be very creative to think of all the ways that a plan could fail. Start with the most obvious, but also consider the likely ways that something could fail that might not be obvious. Deckelmann recommended sharing stories of failure with others, talking to people outside your field and acting out different implementation scenarios in any plan.

Implement

Finally, Deckelmann talked about actual implementation of a plan. Again, Deckelmann talked about teamwork, suggesting a paired approach when working through plans like a migration or code rollout. She suggested that teams share screens, use a chatroom and (optimally) voice communication if teams were working remotely.

She also recommended that teams designate a timekeeper and have a fixed expectation of when a project had “failed” and to start the rollback plan.

Post-Mortem

Deckelmann also stressed the importance of a post-mortem for projects. The post-mortem, says Deckelmann should have an expectation of 100% participation not just from the team but also the clients or users affected by the project. Note that teams should have a post-mortem after any project, and not only if a project had actually failed. Even a successful project may call for improvement.

But don’t try to improve everything at once. Deckelmann recommended that teams limit improvements to one or two things, rather than a laundry list of things to improve. It’s far easier to achieve improvements in small chunks.

During a post-mortem, Deckelmann says that teams should have a note-keeper and time-keeper, and every participant should share a success and a failure.

Success

Overall, I felt like Deckelmann’s talk was a big success. Many of the suggestions she had are common sense, but it’s not at all uncommon to work with teams that don’t have detailed plans or conduct any kind of post-mortem after a project fails. Instead of treating failure as something to be ashamed of, teams should plan for failure and learn to minimize the impact, learn from the mistakes, and move on without repeating them.