Chaos Monkey: How Netflix Uses Random Failure to Ensure Success

In a post last week about lessons learned using Amazon Web Services, Netflix‘s John Ciancutti revealed that the company built something called “Chaos Monkey” to ensure that individual components work independently. Chaos Monkey randomly kills instances and services within Netflix’s AWS infrastructure to help developers to make sure each individual component returns something even when system dependencies aren’t responding.

For example, if the recommendation system is down Netflix will display popular titles instead of personalized picks. The quality of the response is degraded, but least there is a response. Ciancutti explains it this way: “If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”

Here are the lessons Ciancutti writes that Netflix has learned:

  1. Dorothy, you’re not in Kansas anymore (“You need to be prepared to unlearn a lot of what you know”)
  2. Co-tenancy is hard
  3. The best way to avoid failure is to fail constantly
  4. Learn with real scale, not toy models
  5. Commit yourself

Chaos Monkey fits into number three.

For more advice on migrating to the cloud from Netflix, check out our article Netflix’s Advice on Moving to Amazon Web Services.

Facebook Comments